High Impact Factor : 4.396 icon | Submit Manuscript Online icon |

Web data extraction and alignment tools: A survey

Author(s):

Pranali Nikam , DYPIET Pimpri Pune; Vidhya Ghogare, DYPIET Pimpri Pune; Yogita Gote, DYPIET Pimpri Pune; Jyothi Rapalli, DYPIET Pimpri Pune

Keywords:

Combining Tag And Value Similarity (CTVS), Query Result Record (QRR), Data extraction and label assignment for web database (DeLa), Data Extraction Based on Partial Tree Alignment (DEPTA), Visual Perception based Extraction of Records (ViPER), Visual information and Tag Structure based wrapper generator (ViNTS)

Abstract

Data extraction from the web pages is the process of analyzing and retrieving relevant data out of the data sources (usually unstructured or poorly structure) in a specific pattern for further processing, involves addition of metadata and data integration details for further process in the data workflow. This survey describes overview of the different web data extraction and data alignment techniques. Extraction techniques are DeLa, DEPTA, ViPER, and ViNT. Data alignment techniques are Pairwise QRR alignment, Holistic alignment, Nested structure processing. Query Result pages are generated by using Web database based on Users Query. The data from these query result pages should be automatically extracted which is very important for many applications, such as data integration, which are needed to cooperate with multiple web databases. New method is proposed for data extraction t that combines both tag and value similarity. It automatically extracts data from query result pages by first identifying and segmenting the query result records (QRRs) in the query result pages and then aligning the segmented QRRs into a table. In which the data values from the same attribute are put into the same column. Data region identification method identify the noncontiguous QRRs that have the same parents according to their tag similarities. Specifically, we propose new techniques to handle the case when the QRRs are not contiguous, which may be due to presence of auxiliary information, such as a comment, recommendation or advertisement, and for handling any nested structure that may exist in the QRRs.

Other Details

Paper ID: IJSRDV3I1650
Published in: Volume : 3, Issue : 1
Publication Date: 01/04/2015
Page(s): 1496-1501

Article Preview

Download Article