Web crawler: Useful for automating maintenance tasks

A computer program that browses the World Wide Web in a automated, methodical manner, that program is called a Web crawler. For Web crawler, we can use other terms like automatic indexers, ants, worms or bots or Web spider.

A Web crawler is one type of software agent or bot. It starts with a list of URLs to visit, this is called the seeds. It identifies all the hyperlinks in the page as the crawler visits these URLs and adds them to the list of URLs to visit which is called the crawl frontier.

Use of Web crawler

The process is known as spidering or Web crawling. In search engines or in many sites, use spidering as a means of providing up-to-date data. By a search engine, Web crawlers are used to create a copy of all the visited pages for later processing which indexes the downloaded pages to offer fast searches.

On a web site, crawlers can also be used for automating maintenance tasks like validating HTML code or checking links. From web pages, crawlers can be used to capture specific types of information like harvesting e-mail addresses (mainly for spam).

Crawling policies

Important characteristics of the Web are there which make the crawling process very difficult:

•    Its fast rate of change
•    Its large volume,
•    Dynamic page generation

All these characteristics are combined to produce a wide variety of possible crawlable URLs.

The behavior of a Web crawler is the result of a combination of policies-

•    A re-visit policy that states when to check for changes to the pages
•    A parallelization policy that states how to coordinate distributed Web crawlers
•    A politeness policy that states how to avoid overloading Web sites
•    A selection policy that states which pages to download

So, a web crawler is very useful on a web site.