GSMIS

Publication

Journal Publication

Title of Article

Information extraction for deep web using repetitive subject pattern

Date of Acceptance

23 July 2013

Journal

Title of Journal

World Wide Web

Standard

Institute of Journal

Springer US

ISBN/ISSN

1573-1413

Volume

Issue

Month

Year of Publication

2013

Page

1-31

Abstract

In this paper, we propose an information extraction (IE) system for extracting data records from semi-structured documents on the Deep Web using a promising proposed technique, called Repetitive Subject Pattern. This technique was based on the hypothesis that data records in the web page must have a subject item, and the repetitive pattern of the subject items can be used to identify the boundary of data records. The system consists of four automatic tasks: (1) parsing a sample page to a DOM tree, (2) recognizing a subject string in the DOM tree, (3) using the subject string for identifying the pattern of data records and generating a wrapper, and (4) using the generated wrapper for extracting data records. This approach enables the very flexible wrapper generator; when the automatic process generated the wrong wrapper, user can also provide a new sample subject string for generating better wrapper. As the result, the system can be both semi-supervised and unsupervised system. The experimentation shows that the proposed technique provides the outstanding results in generating the very high quality wrappers, with both recall and precision close to 100 % when tested on a number of datasets.

Keyword

Information extraction,Web data extraction, Web content mining, Subject pattern, Wrapper induction, Unsupervised learning

Author

537020029-1	Mr. WACHIRAWUT THAMVISET [Main Author]
	Science Doctoral Degree

Reviewing Status

มีผู้ประเมินอิสระ

Status

ตีพิมพ์แล้ว

Level of Publication

นานาชาติ

citation

false

Part of thesis

true

ใช้สำหรับสำเร็จการศึกษา

ไม่เป็น

Attach file

Citation