|
Publication
|
Research Title |
Bottom-up region extractor for semi-structured web pages |
Date of Distribution |
31 July 2014 |
Conference |
Title of the Conference |
18th Computer Science and Engineering Conference (ICSEC2014) |
Organiser |
Computer Science, Faculty of Science, Khon Kaen University |
Conference Place |
HOTEL PULLMAN KHON KAEN RAJA ORCHID |
Province/State |
Khon Kaen |
Conference Date |
30 July 2014 |
To |
1 August 2014 |
Proceeding Paper |
Volume |
2014 |
Issue |
1 |
Page |
284 - 289 |
Editors/edition/publisher |
IEEE |
Abstract |
Generally, the database websites have provided the interfaces for giving users access their structured data. These data are usually represented in a form of data records in a coherent region of a result page. However, the page usually contains not only the data region, but also other extraneous ones. Therefore, the important tasks for extracting data records from these semi-structured web pages are identifying the relevant data regions and ignoring the irrelevant regions. To figure out the stated problem, This paper proposes a region extractor to be a preprocessor tool for helping an information extractor to locate and extract the relevant data records from web pages. Most existing works analyze the DOM tree of an input page in a top-down manner. In another way, the proposed method traverses the DOM tree in the bottom-up direction that the similarity of the leaf nodes are analyzed prior to find a set of data items. After that, their parent nodes are recursively analyzed for identifying data records and data regions respectively. The proposed method is completely unsupervised and maintenance-free wrapper. For performance evaluation, it is empirically tested on 15 real-world websites. Experiments show that the proposed method achieves 94.37% accuracy of data record extraction and outperforms the well-known top-down method, DEPTA (55.39%). |
Author |
|
Peer Review Status |
มีผู้ประเมินอิสระ |
Level of Conference |
นานาชาติ |
Type of Proceeding |
Full paper |
Type of Presentation |
Oral |
Part of thesis |
true |
Presentation awarding |
false |
Attach file |
|
Citation |
0
|
|
|
|
|
|
|