Apr 12, 2013

Google’s Crawling Process De-Mystified


Google’s Crawling Process De-MystifiedIt is interesting to know that the process of crawling is done with the help of a program known as Google-bot. The program is so written that it follows an algorithmic process to determine which sites to crawl, when to crawl, and how often to crawl. 

The crawling process starts from one URL, the bot then encounters an external link or Internal Link. In a particular URL, Google-bot processes metatags, Image sources, Hyperlinks, ALT tags. However, Google confesses that its algorithm cannot process rich media files or dynamic pages.

The frequency of Google’s crawl process is a grey area. Two different types of crawl process happen on a website.
  • Fresh Crawl: Maximum 6 days to crawl. Very few URLs crawled at a time.
  • Deep Crawl: Extensive amount of crawling done. Might take upto a month for completion.

Parts of Google

Similarly for Google's search Process, There are three main parts of Google’s search process.
  • Google Bot
  • The Indexer
  • The Query Processor

Google Bot

Google-Bot or Google’s web crawler is a program which crawls and pulls pages from the web and gives it to the Indexer. Its function in detail is to find a page, download or cache the entire web page and send it to the Indexer. It is interesting to note, that to prevent overloading of servers by Google’s requests, the Google-bot, deliberately crawls through the site slower than its Have a look at the Google Guide.

Google’s Indexer

The indexer is basically a dictionary or directory operative. The pages retrieved by Google-Bot is stored in the Indexer, sorted alphabetically. The indexer ignores words like ‘of’, ‘on’, ‘how’, ‘why’ and punctuation marks in addition to multiple spaces. Read More at Google Truths

Google’s Query Processor

The processor’s main function is to process the commands given by the user and match them to the already documented files in the indexer. The complete process is carried out in accordance with the so called, Google’s PageRank. You can find out more about PageRank in previous articles.
It is interesting to note that a single query made by the user to the search box, 1000 machines in 0.2 seconds as per Amit agrawal from Labnol

This was all about,
Google’s Crawling Process De-Mystified MohitChar