A Streamlined Approach for Tabular Information Extraction

Wang, H.L.; Hsu, W.L.; Chen, Y.S.; Lau, T.L.; Tang, C.H.

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.author	Wang, H.L.
dc.contributor.author	Hsu, W.L.
dc.contributor.author	Chen, Y.S.
dc.contributor.author	Lau, T.L.
dc.contributor.author	Tang, C.H.
dc.date.accessioned	2009-06-02T07:23:32Z
dc.date.accessioned	2020-05-29T06:18:13Z	-
dc.date.available	2009-06-02T07:23:32Z
dc.date.available	2020-05-29T06:18:13Z	-
dc.date.issued	2006-11-13T01:31:57Z
dc.date.submitted	1999-12-20
dc.identifier.uri	http://dspace.fcu.edu.tw/handle/2377/3124	-
dc.description.abstract	In this paper, we propose a streamlined approach for extracting information form tables in HTML format. Our approach is based on a set of semantic templates associated with the knowledge representation maps. We apply an abstract model on the templates to support the extraction of tabular logical structure in different stages. Our abstract model includes category identification, reading path construction, and record collection. In this model, we use an abstract table to separate the logical structure from the physical layout. For each table, we try to extract the abstract table from its physical layout. Our approach has three stages. In the first stage, we use semantic tagging templates to identify all possible categories of the cells in the table. In the second stage, we construct the reading path by a cell linking algorithm. In the final stage, we perform the reverse traversal on the reading paths to extract and collect records from this table. We have implemented a prototype of tabular logical structure extraction system in MS-Windows environment. The prototype system provides an interface by which users can input a table in HTML format. Our system also has an interface to output the abstract table of the input table. We have done some experiments on several tables with distinct layout styles by suing our system. Our experimental results show that our prototype system can extract the logical structure of these tables with high precision and recall rate.
dc.description.sponsorship	淡江大學, 台北縣
dc.format.extent	8p.
dc.format.extent	587670 bytes
dc.format.mimetype	application/pdf
dc.language.iso	zh_TW
dc.relation.ispartofseries	1999 NCS會議
dc.subject.other	資訊擷取與資料挖掘
dc.title	A Streamlined Approach for Tabular Information Extraction
分類:	1999年 NCS 全國計算機會議

文件中的檔案：

檔案	描述	大小	格式
ce07ncs001999000116.pdf		579.52 kB	Adobe PDF	檢視/開啟

顯示文件簡單紀錄

在 DSpace 系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

逢甲大學校園典藏知識庫