題名: Semi-Structured Information Extraction Applying Automatic Pattern Discovery
作者: Chang, Chia-Hui
Lui, Shao-Chen
Wu, Yen-Chin
期刊名/會議名稱: 2000 ICS會議
摘要: Information extraction (IE) from semi-structured Web documents is a critical issue for information integration systems on the Internet. Previous work in wrapper in- duction aim to solve this problem by applying machine learning to automatically generate extractors. For ex- ample, WIEN, Stalker, Softmealy, etc. However, this approach still requires human intervention to provide training examples. Hence, the other track to informa- tion extraction tries to save human eort. For exam- ple, Embley et. al. and Chang et al. present dier- ent approaches to record boundary identication of a single Web pages without any training example. Emb- ley's work relies on the intra-page structure constructed by HTML tags (the parse tree), while Chang's work is motivated by repeated patterns formed by multiple aligned records. This paper expands Chang's work to IE and discuss the issues when applying pattern dis- covery for record identication, including the encoding schemes of HTML and ranking criteria of patterns to extract record boundary.
日期: 2006-10-26T03:15:50Z
分類:2000年 ICS 國際計算機會議

文件中的檔案:
檔案 描述 大小格式 
ce07ics002000000052.pdf221.78 kBAdobe PDF檢視/開啟


在 DSpace 系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。