Date of Award
Doctor of Philosophy (PhD)
Computational Analysis and Modeling
Automatic classification of web pages is an effective way to facilitate the process of retrieving information from the Internet. Currently, two major classification methods are used in this area: keyword-based classification and sense-based classification. For keyword-based classification, keywords often have different semantic meanings, and the correct keyword matching is largely based on using exactly the same keywords. Thus, the classification results of keyword-based classification are not always satisfying. Many sense-based classification algorithms and systems have been presented, but they pay little attention to the relationship between senses. In this dissertation, we present a method to automatically classify documents based on the meanings of words and the relationships between groups of meanings or concepts. The classification algorithm builds on the word sense structures provided by a lexical database, which not only arranges words into groups of synonyms, but also arranges these groups of synonyms into hierarchies that represent the relationships between concepts.
Another problem with current classification systems is that most of them ignore the conflict between the fixed number of categories and the growing number of documents being added to the system. To address this problem, a category-based clustering method is developed to automatically extract a new category from a category that needs to be split. A category must be divided when the number of documents in the category is larger than a predefined size. Experimental results show that the semantic hierarchy classification algorithm increases the classification accuracy by 13% compared to existing sense-based classification algorithms. The category-based clustering algorithm achieves a higher quality cluster than other existing methods that do not use category information. Combining the automatic classification based on word meanings and the dynamic addition of new categories based on clustering, we develop a new system to meet the current and future needs of a growing Internet.
Peng, Xiaogang, "" (2004). Dissertation. 606.