Use of Web crawler in automating maintenance tasks

Web crawler is a computer program that browses the World Wide Web in a automated, methodical manner. So, inspite of using Web crawler term, you can also use other terms such as automatic indexers, worms, ants or bots or Web spider.

A Web crawler is a type of software agent or bot. It starts with a list of URLs to visit, this is called the seeds. It identifies all the hyperlinks in the page as the crawler visits these URLs and adds them to the list of URLs to visit which is called the crawl frontier.

Importance of Web crawler

The process is called as Web crawling or spidering. In search engines or in many sites, use spidering as a means of providing up-to-date data. Web crawlers are used to create a copy of all the visited pages for later processing by a search engine, which indexes the downloaded pages to offer fast searches.

Crawlers can also be used for automating maintenance tasks on a web site, like validating HTML code or checking links. From web pages, crawlers can be used to capture specific types of information like harvesting e-mail addresses (mainly for spam).

Policies of Crawler

Characteristics of the Web are there which make the crawling process very difficult:

Its large volume
Its fast rate of change
Dynamic page generation

All these characteristics are combined to produce a wide variety of possible crawlable URLs. Thus, a web crawler is very useful on a web site.