GSMIS

ข้อมูลการเผยแพร่ผลงาน

การเผยแพร่ในรูปของบทความวารสารทางวิชาการ

ชื่อบทความ

Information extraction for deep web using repetitive subject pattern

วัน/เดือน/ปี ที่ได้ตอบรับ

23 กรกฎาคม 2556

วารสาร

ชื่อวารสาร

World Wide Web

มาตรฐานของวารสาร

หน่วยงานเจ้าของวารสาร

Springer US

ISBN/ISSN

1573-1413

ปีที่

ฉบับที่

เดือน

ปี พ.ศ. ที่พิมพ์

2556

หน้า

1-31

บทคัดย่อ

In this paper, we propose an information extraction (IE) system for extracting data records from semi-structured documents on the Deep Web using a promising proposed technique, called Repetitive Subject Pattern. This technique was based on the hypothesis that data records in the web page must have a subject item, and the repetitive pattern of the subject items can be used to identify the boundary of data records. The system consists of four automatic tasks: (1) parsing a sample page to a DOM tree, (2) recognizing a subject string in the DOM tree, (3) using the subject string for identifying the pattern of data records and generating a wrapper, and (4) using the generated wrapper for extracting data records. This approach enables the very flexible wrapper generator; when the automatic process generated the wrong wrapper, user can also provide a new sample subject string for generating better wrapper. As the result, the system can be both semi-supervised and unsupervised system. The experimentation shows that the proposed technique provides the outstanding results in generating the very high quality wrappers, with both recall and precision close to 100 % when tested on a number of datasets.

คำสำคัญ

Information extraction,Web data extraction, Web content mining, Subject pattern, Wrapper induction, Unsupervised learning

ผู้เขียน

537020029-1	นาย วชิราวุธ ธรรมวิเศษ [ผู้เขียนหลัก]
	คณะวิทยาศาสตร์ ปริญญาเอก ภาษาอังกฤษ

การประเมินบทความ

มีผู้ประเมินอิสระ

สถานภาพการเผยแพร่

ตีพิมพ์แล้ว

วารสารมีการเผยแพร่ในระดับ

นานาชาติ

citation

ไม่มี

เป็นส่วนหนึ่งของวิทยานิพนธ์

เป็น

ใช้สำหรับสำเร็จการศึกษา

ไม่เป็น

แนบไฟล์

Citation