題名: | A Scalable Approach for Chinese Term Extraction |
作者: | Tsay, Jyh-Jong Wang, Jing-Doo |
期刊名/會議名稱: | 2000 ICS會議 |
摘要: | Term extraction is very helpful for Information Retrieval(IR) systems to have higher precision in retrieval, and that this capability is in demand for all of the Internet searching tools. In this paper, we develop a scalable approach via String B-tree(SB-tree) to identify significant terms from large amount of Chinese text data, which does not use a dictionary. Our approach consists of four steps : (i) texts information database, (ii) SB-tree construction, (iii) candidate significant term extraction and (iv) significant term validation. Our experiment uses three year news from Central News Agency(CNA) as the source to extract significant terms. The total number of the news and characters are 220; 395 and 80; 046; 457 respectively. With the training corpus from such a long time period, we not only have robust statistic of terms, i.e. term frequency and document frequency, but also can detect some events via the distribution of significant terms according to different scale of time interval. What we have done is somewhat a fundamental work of text data warehouse. |
日期: | 2006-11-17T03:39:38Z |
分類: | 2000年 ICS 國際計算機會議 |
文件中的檔案:
檔案 | 描述 | 大小 | 格式 | |
---|---|---|---|---|
ce07ics002000000199.pdf | 1.72 MB | Adobe PDF | 檢視/開啟 |
在 DSpace 系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。