題名: | A Streamlined Approach for Tabular Information Extraction |
作者: | Wang, H.L. Hsu, W.L. Chen, Y.S. Lau, T.L. Tang, C.H. |
期刊名/會議名稱: | 1999 NCS會議 |
摘要: | In this paper, we propose a streamlined approach for extracting information form tables in HTML format. Our approach is based on a set of semantic templates associated with the knowledge representation maps. We apply an abstract model on the templates to support the extraction of tabular logical structure in different stages. Our abstract model includes category identification, reading path construction, and record collection. In this model, we use an abstract table to separate the logical structure from the physical layout. For each table, we try to extract the abstract table from its physical layout. Our approach has three stages. In the first stage, we use semantic tagging templates to identify all possible categories of the cells in the table. In the second stage, we construct the reading path by a cell linking algorithm. In the final stage, we perform the reverse traversal on the reading paths to extract and collect records from this table. We have implemented a prototype of tabular logical structure extraction system in MS-Windows environment. The prototype system provides an interface by which users can input a table in HTML format. Our system also has an interface to output the abstract table of the input table. We have done some experiments on several tables with distinct layout styles by suing our system. Our experimental results show that our prototype system can extract the logical structure of these tables with high precision and recall rate. |
日期: | 2006-11-13T01:31:57Z |
分類: | 1999年 NCS 全國計算機會議 |
文件中的檔案:
檔案 | 描述 | 大小 | 格式 | |
---|---|---|---|---|
ce07ncs001999000116.pdf | 579.52 kB | Adobe PDF | 檢視/開啟 |
在 DSpace 系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。