Web data extraction and alignment tools: A survey |
Author(s): |
Pranali Nikam , DYPIET Pimpri Pune; Vidhya Ghogare, DYPIET Pimpri Pune; Yogita Gote, DYPIET Pimpri Pune; Jyothi Rapalli, DYPIET Pimpri Pune |
Keywords: |
Combining Tag And Value Similarity (CTVS), Query Result Record (QRR), Data extraction and label assignment for web database (DeLa), Data Extraction Based on Partial Tree Alignment (DEPTA), Visual Perception based Extraction of Records (ViPER), Visual information and Tag Structure based wrapper generator (ViNTS) |
Abstract |
Data extraction from the web pages is the process of analyzing and retrieving relevant data out of the data sources (usually unstructured or poorly structure) in a specific pattern for further processing, involves addition of metadata and data integration details for further process in the data workflow. This survey describes overview of the different web data extraction and data alignment techniques. Extraction techniques are DeLa, DEPTA, ViPER, and ViNT. Data alignment techniques are Pairwise QRR alignment, Holistic alignment, Nested structure processing. Query Result pages are generated by using Web database based on Users Query. The data from these query result pages should be automatically extracted which is very important for many applications, such as data integration, which are needed to cooperate with multiple web databases. New method is proposed for data extraction t that combines both tag and value similarity. It automatically extracts data from query result pages by ï¬rst identifying and segmenting the query result records (QRRs) in the query result pages and then aligning the segmented QRRs into a table. In which the data values from the same attribute are put into the same column. Data region identification method identify the noncontiguous QRRs that have the same parents according to their tag similarities. Speciï¬cally, we propose new techniques to handle the case when the QRRs are not contiguous, which may be due to presence of auxiliary information, such as a comment, recommendation or advertisement, and for handling any nested structure that may exist in the QRRs. |
Other Details |
Paper ID: IJSRDV3I1650 Published in: Volume : 3, Issue : 1 Publication Date: 01/04/2015 Page(s): 1496-1501 |
Article Preview |
|
|