Date of Award
Doctor of Philosophy (PhD)
Computational Analysis and Modeling
This dissertation aims to create new sentences to summarize text documents. In addition to generating new sentences, this project also generates new concepts and extracts key sentences to summarize documents. This project is the first research work that can generate new key concepts and can create new sentences to summarize documents.
Automatic document summarization is the process of creating a condensed version of the document. The condensed version extracts the key contents from the original document. Most related research uses statistical methods that generate a summary based on word distribution in the document. In this dissertation, we create a summary based on concept distributions and concept hierarchies. We use Stanford parser as our syntax parser and ResearchCyc (Cyc) as our knowledge base. Words and phrases of a document are mapped into Cyc concepts. We introduce a unique concept propagation method to generate abstract concepts and use those abstract concepts for the summarization. This method has two advantages over the existing methods. One advantage is the use of multi-level upward propagation to solve the word sense disambiguation problem. The other is that the propagation process provides a method to produce generalized concepts.
In the first part of the project, we generate a summary by extracting key concepts and key sentences from documents. We use Stanford parser to segment a document to sentences and to parse each sentence to words or phrases tagged with their part-of-speeches. We use Cyc commands to map those words and phrases to their corresponding Cyc concepts and increase the weights of those concepts. To handle word sense disambiguation and to create summarized concepts, we propagate the weight of the concepts upward along the Cyc concept hierarchy. Then, we extract the concepts with some of the highest weights to be the key concepts. To extract key sentences from the document, we weigh each sentence in the document based on the concept weight associated with the sentence. Then, we extract the sentences with some of the highest weights to summarize the document.
In the second part of the project, we generate new sentences to summarize a document based on the generalized concepts. First, we extract the subject, predicate, and object from each sentence. Then, we create compatible matrices based on the compatibility between the subjects, predicates, and objects among sentences. Two terms are considered to be compatible if the following three conditions hold: the two terms are the same concept, one concept is the other concept's immediate super class, or two concepts share the same immediate super class. From the compatible matrices, we build compatible clusters and finally generate new sentences for each compatible cluster. These newly generated sentences serve as a summary for the document.
We have implemented and tested our approaches. The test results show that our approaches are viable and have great potential for future research.
Huang, Xiaomei, "" (2009). Dissertation. 488.