2012 ©
             Publication
Journal Publication
Title of Article Information extraction for deep web using repetitive subject pattern 
Date of Acceptance 23 July 2013 
Journal
     Title of Journal World Wide Web 
     Standard  
     Institute of Journal Springer US 
     ISBN/ISSN 1573-1413 
     Volume  
     Issue  
     Month
     Year of Publication 2013 
     Page 1-31 
     Abstract In this paper, we propose an information extraction (IE) system for extracting data records from semi-structured documents on the Deep Web using a promising proposed technique, called Repetitive Subject Pattern. This technique was based on the hypothesis that data records in the web page must have a subject item, and the repetitive pattern of the subject items can be used to identify the boundary of data records. The system consists of four automatic tasks: (1) parsing a sample page to a DOM tree, (2) recognizing a subject string in the DOM tree, (3) using the subject string for identifying the pattern of data records and generating a wrapper, and (4) using the generated wrapper for extracting data records. This approach enables the very flexible wrapper generator; when the automatic process generated the wrong wrapper, user can also provide a new sample subject string for generating better wrapper. As the result, the system can be both semi-supervised and unsupervised system. The experimentation shows that the proposed technique provides the outstanding results in generating the very high quality wrappers, with both recall and precision close to 100 % when tested on a number of datasets. 
     Keyword Information extraction,Web data extraction, Web content mining, Subject pattern, Wrapper induction, Unsupervised learning 
Author
537020029-1 Mr. WACHIRAWUT THAMVISET [Main Author]
Science Doctoral Degree

Reviewing Status มีผู้ประเมินอิสระ 
Status ตีพิมพ์แล้ว 
Level of Publication นานาชาติ 
citation false 
Part of thesis true 
Attach file
Citation 1