Web Spider Design for Data Warehousing |
Author(s): |
| Prof. Namrata Yesansure , PRIYADARSHINI COLLEGE OF ENGINEERING; Gayatri Machhirke, PRIYADARSHINI COLLEGE OF ENGINEERING; Priya Meshram, PRIYADARSHINI COLLEGE OF ENGINEERING; Priyanka Kale, PRIYADARSHINI COLLEGE OF ENGINEERING; Yugandhara Thak, PRIYADARSHINI COLLEGE OF ENGINEERING |
Keywords: |
| Data Warehousing, Web Spider |
Abstract |
|
The content of the web has increasingly become a focus for academic research. Computer programs are needed in order to conduct any large-scale processing of web pages, requiring the use of a web crawler at some stage in order to fetch the pages to be analyzed. The processing of the text of web pages in order to extract information can be expensive in terms of processor time. The data ware housing maintain large and heterogeneous data from various input sources, website is one of them and selecting a particular data from a website is tedious and time consuming job. Searching for particular data in a website required lots of browsing efforts and manual efforts to copy the data from web pages. User has to download the images manually from webpage. The main objective is to develop an application based on web crawling algorithm which fetch the data for user for given keywords from the given target website and save them in database. The application is based on Java swing and Crawler4J API. Crawler4j is an open source web crawler for Java which provides a simple interface for crawling the Web. For database MYSQL is used. The Web crawler are mainly used to create a copy of all the visited pages for later processing by a search engine that will index the downloaded page to provide fast searches. Everyone use Information from website for their research, publications, advertisement, marketing etc. Many times this data came from databanks or data warehouses. In short many people perform web data mining for gathering information. To overcome this problem, software called “Web Crawler†is applied which uses various kinds of algorithms to achieve the goal. These algorithms use various kinds of heuristic functions to increase efficiency of crawlers. In this paper, we intend to provide a new crawling system which will be designed using java crawler4J API. This system will be a platform that will allow the user to crawler on given website to find the relevant data. The system will also be integrated with report functionality which provides an option to access the crawler data at a later time. In this system, we design 5 different types of the crawler to suits different needs of the user. Once crawler starts the crawling it will show the outcome in process window and if the data found it will be stored in the keyword table. While running the test cases on different domains we find out that it's results are acceptable from user points and our aims are achieved. |
Other Details |
|
Paper ID: IJSRDV6I20347 Published in: Volume : 6, Issue : 2 Publication Date: 01/05/2018 Page(s): 725-727 |
Article Preview |
|
|
|
|
