|
|
Publication
|
| Research Title |
Bottom-up region extractor for semi-structured web pages |
| Date of Distribution |
31 July 2014 |
| Conference |
| Title of the Conference |
18th Computer Science and Engineering Conference (ICSEC2014) |
| Organiser |
Computer Science, Faculty of Science, Khon Kaen University |
| Conference Place |
HOTEL PULLMAN KHON KAEN RAJA ORCHID |
| Province/State |
Khon Kaen |
| Conference Date |
30 July 2014 |
| To |
1 August 2014 |
| Proceeding Paper |
| Volume |
2014 |
| Issue |
1 |
| Page |
284 - 289 |
| Editors/edition/publisher |
IEEE |
| Abstract |
Generally, the database websites have provided the interfaces for giving users access their structured data. These data are usually represented in a form of data records in a coherent region of a result page. However, the page usually contains not only the data region, but also other extraneous ones. Therefore, the important tasks for extracting data records from these semi-structured web pages are identifying the relevant data regions and ignoring the irrelevant regions. To figure out the stated problem, This paper proposes a region extractor to be a preprocessor tool for helping an information extractor to locate and extract the relevant data records from web pages. Most existing works analyze the DOM tree of an input page in a top-down manner. In another way, the proposed method traverses the DOM tree in the bottom-up direction that the similarity of the leaf nodes are analyzed prior to find a set of data items. After that, their parent nodes are recursively analyzed for identifying data records and data regions respectively. The proposed method is completely unsupervised and maintenance-free wrapper. For performance evaluation, it is empirically tested on 15 real-world websites. Experiments show that the proposed method achieves 94.37% accuracy of data record extraction and outperforms the well-known top-down method, DEPTA (55.39%). |
| Author |
|
| Peer Review Status |
มีผู้ประเมินอิสระ |
| Level of Conference |
นานาชาติ |
| Type of Proceeding |
Full paper |
| Type of Presentation |
Oral |
| Part of thesis |
true |
| ใช้สำหรับสำเร็จการศึกษา |
ไม่เป็น |
| Presentation awarding |
false |
| Attach file |
|
| Citation |
0
|
|
|
|
|
|
|
|
|