CFS-based Unigram Language Model, a Brave New Approach to Traditional N-gram Language Models

Lin, Yih-Jeng; Yu, Ming-Shing

題名:	CFS-based Unigram Language Model, a Brave New Approach to Traditional N-gram Language Models
其他題名:	以中文常用字串為基礎的unigram語言模型, 一個優於傳統語言模型的新方法
作者:	Lin, Yih-Jeng Yu, Ming-Shing
關鍵字:	Chinese frequent strings Chinese toneless phoneme-to-character Chinese spelling
期刊名/會議名稱:	中華民國92年全國計算機會議
摘要:	This paper introduces a new concept, the Chinese frequent strings (CFS) based unigram language model, wh ich is in many respects superior to the traditional language model (LM). Important properties of CFSs and applications in Chinese natural language processing (NLP) will be revealed in this paper. We have proposed a methodology for extracting Chinese frequent strings, which contain unknown words, from a Chinese corpus. We found that CFSs contain many 4-gram characters, 3-gram words, and higher n-grams. Such information can only be derived with an extremely large corpus in a traditional language model. In contrast to using a traditional LM, we can achieve high precision and efficiency by using CFSs to solve Chinese toneless phoneme-to-character conversion and to correct Chinese spelling errors with a small training corpus. An accuracy of 92.86% was achieved for Chinese toneless phoneme-to-character conversion. An accuracy of 87.32% was achieved for Chinese spelling error correction. We used a traditional lexicon, namely the ASCED (Academia Sinica Chinese Electronic Dictionary) provided by Academia Sinica, Taiwan, and the word bigram language model to solve the two abovementioned problems. We achieved accuracies of 66.9% and 80.95% respectively for Chinese toneless phoneme-to-character conversion and Chinese spelling error correction.
日期:	2006-06-06T02:03:37Z
分類:	2003年 NCS 全國計算機會議

文件中的檔案：

檔案	描述	大小	格式
WS_023200380.pdf		196.42 kB	Adobe PDF	檢視/開啟

顯示文件完整紀錄

在 DSpace 系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

逢甲大學校園典藏知識庫