|
Publication
|
Research Title |
Structured web information extraction using repetitive subject pattern |
Date of Distribution |
18 May 2012 |
Conference |
Title of the Conference |
9th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON 2012) |
Organiser |
Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI) Association, Thailand. |
Conference Place |
Novotel Hua Hin - Cha Am Beach Resort & Spa |
Province/State |
Phetchaburi, Thailand |
Conference Date |
16 May 2012 |
To |
18 May 2012 |
Proceeding Paper |
Volume |
2012 |
Issue |
1 |
Page |
1-4 |
Editors/edition/publisher |
IEEE Xplore |
Abstract |
Data records on a dynamic web page are often generated from databases with fixed templates or layouts by server-side scripts. Generally, each data record on the web page has a subject item that can be used to identify a data record. This paper reports a novel semi-supervised information extraction system that lets end-users give only one subject item of sample data record. The system then builds a wrapper and extracts the relevant data records automatically. The techniques for the proposed system are a repetitive subject pattern for discovery data records, a subject tree clustering algorithm for clustering target data records, and a subject tree alignment for aligning data items and create an extraction pattern. For performance evaluation purpose, the proposed system is empirically tested on twelve popular real world websites both Thai and English. It provides the outstanding result by reporting 100 percentage of accuracy for correct extracted records. In addition, the proposed system shows higher degree of being user friendly when compared with other similar systems. |
Author |
|
Peer Review Status |
มีผู้ประเมินอิสระ |
Level of Conference |
นานาชาติ |
Type of Proceeding |
Full paper |
Type of Presentation |
Oral |
Part of thesis |
true |
Presentation awarding |
false |
Attach file |
|
Citation |
2
|
|
|
|
|
|
|