|
Publication
|
Title of Article |
Data Regions Extraction for Semi-Structured Web Pages Using Bottom-up Approach |
Date of Acceptance |
26 December 2014 |
Journal |
Title of Journal |
KKU RESEARCH JOURNAL (GRADUATE STUDIES) |
Standard |
TCI |
Institute of Journal |
Graduate School, Khon Kaen University |
ISBN/ISSN |
1906-201X |
Volume |
14 |
Issue |
4 |
Month |
December |
Year of Publication |
2014 |
Page |
1-16 |
Abstract |
In this paper, we propose an unsupervised information extraction system called Bottom-up Wrapper (BUW) for automatic extracting the data regions from the semi-structured web pages such as search result pages, product catalog pages, etc. Although, data records in a semi-structured web page are generated from backend databases and encoded into the HTML with fixed templates from server-side scripts, but these data records are represented without the structural information. Moreover, the complexity of the website is increasing, that make it difficult to automatically identify the correct data region and extract the relevant data records. While, many existing techniques use a top-down approach that starts to identify the data regions before the data records and data items. In another way, we figured out the stated problem in a bottom-up way that starts to analyze the repetitive patterns of data items, which can be used for identifying the relevant data records and data regions. This technique is completely unsupervised and maintenance-free wrapper. For performance evaluation purpose, it is empirically tested on the real world websites. Consequently, it provides the outstanding result that the proposed technique is robust and in many cases outperforms existing wrappers such as RSP and SDE (based onDEPTA). |
Keyword |
Information extraction(การสกัดสารสนเทศ);Bottom-up approach(กระบวนการแบบล่างขึ้นบน);Semi-structured web pages(หน้าเว็บแบบกึ่งโครงสร้าง) |
Author |
|
Reviewing Status |
มีผู้ประเมินอิสระ |
Status |
ตีพิมพ์แล้ว |
Level of Publication |
ชาติ |
citation |
true |
Part of thesis |
true |
Attach file |
|
Citation |
0
|
|
|
|
|
|
|