Title of Article Data Regions Extraction for Semi-Structured Web Pages Using Bottom-up Approach 
Date of Acceptance 26 December 2014 
     Institute of Journal Graduate School, Khon Kaen University 
     ISBN/ISSN 1906-201X 
     Volume 14 
     Month December
     Year of Publication 2014 
     Page 1-16 
     Abstract In this paper, we propose an unsupervised information extraction system called Bottom-up Wrapper (BUW) for automatic extracting the data regions from the semi-structured web pages such as search result pages, product catalog pages, etc. Although, data records in a semi-structured web page are generated from backend databases and encoded into the HTML with fixed templates from server-side scripts, but these data records are represented without the structural information. Moreover, the complexity of the website is increasing, that make it difficult to automatically identify the correct data region and extract the relevant data records. While, many existing techniques use a top-down approach that starts to identify the data regions before the data records and data items. In another way, we figured out the stated problem in a bottom-up way that starts to analyze the repetitive patterns of data items, which can be used for identifying the relevant data records and data regions. This technique is completely unsupervised and maintenance-free wrapper. For performance evaluation purpose, it is empirically tested on the real world websites. Consequently, it provides the outstanding result that the proposed technique is robust and in many cases outperforms existing wrappers such as RSP and SDE (based onDEPTA).  
     Keyword Information extraction(การสกัดสารสนเทศ);Bottom-up approach(กระบวนการแบบล่างขึ้นบน);Semi-structured web pages(หน้าเว็บแบบกึ่งโครงสร้าง) 
