題名: Comparing Classifiers for Automatic Chinese Text Categorization
其他題名: 中文文件自動化分類方法之比較
作者: Tsay, Jyh-Jong
Wang, Jing-Doo
關鍵字: Text Categorization
Term Selection
Term Clustering
naive Bayes
Rocchio
k-Nearest Neighbor
期刊名/會議名稱: 1999 NCS會議
摘要: In this paper, we make an extensive comparison of three classifiers, naive Bayes (NB) probabilistic classifier, Rocchio linear classifier and k-Nearest Neighbor (kNN) classifier for Chinese text classification. Our goal is to compare their performance when they are integrated with term selection, term clustering and instance selection methods. Our experiment use one year CNA news articles to extract meaningful terms, one month news articles as training data and 3-day news articles as testing data. When the dimension of term space is high, about 90,000, that Rocchio linear classifier achieves the best average accuracy, 79.35%. The observation is different from previous research that Rocchio have relatively poor performance. When the dimension is reduced to 3,600 by a combination of term selection and term clustering, kNN achieves the best average accuracy, 80.24%. We further use Generalized Instance Set (GIS) algorithm[13] to reduce the size of training data and hence speed up on-line classification of kNN. Experiment show that application of GIS can reduce the number of training data from 6,254 to 1,195, while improving the accuracy of kNN from 80.24% to 81.12%. The last accuracy achieved by previous related research is about 78%.
日期: 2006-11-13
分類:1999年 NCS 全國計算機會議

文件中的檔案:
檔案 描述 大小格式 
ce07ncs001999000115.pdf759.04 kBAdobe PDF檢視/開啟


在 DSpace 系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。