Abstract:
A web crawler is a computer program or piece of software that systematically and consistently
browses the Internet and downloads web content.A web crawler automatically detects and
organizes resources from the internet in an organized manner in accordance with user needs.The
web crawler is a equipment that is essential to search engine optimization. The crawler of a
search engine may browse webpages and gather important links from the internet, it will evaluate
each web page's significance based on metrics such as the number of pages that link to it, Search
engines would maintain records of the webpages their web crawlers had visited and crawled after
indexing. Your website's pages won't appear in search results if they aren't indexed.In a short
period of time, the crawler contacts millions of websites, consuming a sizable amount of network,
storage, and memory in the process. Therefore, web crawlers are becoming more and more
prominent over time. While scalability and robustness were previously the focus of research,
Intersecting sub-problems, lesser scalability, increasing runtime and delayed network loading,
low load balancing rate, lower rate of failure tolerance, etc. are some research gaps in this area.
In this paper, an effort to cope with failure is the major focus is on deployment and tolerance
between internal and external links. The crawling programmatic approach and its different
phases are only explained briefly in a small number of papers. In this paper, we addressed the
implementation of web crawling's underlying knowledge, to make it more easier for the audience
to understand.
Keywords : Web crawler, Search Engines, Web pages, web spider.