題名: | Automatic Extraction of Information Blocks Using PAT Trees |
作者: | Chang, Chia-Hui Hsu, Chun-Nan |
期刊名/會議名稱: | 1999 NCS會議 |
摘要: | Information extraction from semi-structured Web documents is a critical issue for software agents on the Internet. Previous work in wrapper induction aim to solve this problem by applying machine learning to automatically generate extractors, but this approach still requires human intervention to provide training examples. In this paper, we present a novel approach that extracts information blocks without training examples using a data structure called a PAT tree. PAT trees allow the system to efficiently recognize repeated patterns in a semi-structured Web page. From these repeated patterns, information blocks can be easily located based on some domain independent selection criteria. The entire system runs automatically without any human intervention. Experimental results show that our approach performs well with a recall rate near 90 percent on a wide range of output pages of popular search engines. |
日期: | 2006-11-13 |
分類: | 1999年 NCS 全國計算機會議 |
文件中的檔案:
檔案 | 描述 | 大小 | 格式 | |
---|---|---|---|---|
ce07ncs001999000117.pdf | 852.95 kB | Adobe PDF | 檢視/開啟 |
在 DSpace 系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。