題名: | Semi-Structured Information Extraction Applying Automatic Pattern Discovery |
作者: | Chang, Chia-Hui Lui, Shao-Chen Wu, Yen-Chin |
期刊名/會議名稱: | 2000 ICS會議 |
摘要: | Information extraction (IE) from semi-structured Web documents is a critical issue for information integration systems on the Internet. Previous work in wrapper in- duction aim to solve this problem by applying machine learning to automatically generate extractors. For ex- ample, WIEN, Stalker, Softmealy, etc. However, this approach still requires human intervention to provide training examples. Hence, the other track to informa- tion extraction tries to save human eort. For exam- ple, Embley et. al. and Chang et al. present dier- ent approaches to record boundary identication of a single Web pages without any training example. Emb- ley's work relies on the intra-page structure constructed by HTML tags (the parse tree), while Chang's work is motivated by repeated patterns formed by multiple aligned records. This paper expands Chang's work to IE and discuss the issues when applying pattern dis- covery for record identication, including the encoding schemes of HTML and ranking criteria of patterns to extract record boundary. |
日期: | 2006-10-26T03:15:50Z |
分類: | 2000年 ICS 國際計算機會議 |
文件中的檔案:
檔案 | 描述 | 大小 | 格式 | |
---|---|---|---|---|
ce07ics002000000052.pdf | 221.78 kB | Adobe PDF | 檢視/開啟 |
在 DSpace 系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。