Google has released the latest installment of its educational video series How Search Works, explaining how its search engine discovers and accesses web pages through crawling.
Google Engineer Detailed Crawl Process
In a seven-minute episode hosted by Google analyst Gary Illyes, the company details the technical aspects of how Googlebot (the software Google uses to crawl the web) works.
Illyes outlines the steps Googlebot takes to find new and updated content from the internet’s trillions of web pages and make it searchable on Google.
Ilies explains:
“Most new URLs that Google discovers come from other known pages that Google has previously crawled.
Consider a news site with various category pages that link to individual news articles.
Google can find most published articles by visiting the category page from time to time and extracting the URL that leads to the article. ”
How Googlebot crawls the web
Googlebot starts by discovering new URLs by following links from known web pages. This process is called URL discovery.
Avoid site overload by crawling each site at its own customized speed based on server response time and content quality.
Googlebot uses current versions of the Chrome browser to render pages, run JavaScript, and correctly display dynamic content loaded by scripts. It also only crawls public pages, not pages after login.
Improved discovery and crawlability
Illyes emphasized the usefulness of sitemaps (XML files that list a site’s URLs) to help Google discover and crawl new content.
He advised developers to let their content management systems automatically generate sitemaps.
Optimizing technical SEO factors such as your site’s architecture, speed, and crawl directives will also improve its crawlability.
Here are some additional tactics to make your site more crawlable.
- Avoid crawl budget exhaustion – If your website is updated frequently, Googlebot’s crawl budget may be exceeded and new content may not be discovered. Careful configuration of your CMS and use of rel= “next” / rel= “prev” tags can help.
- Implement proper internal linking – Linking to new content from category pages or hub pages allows Googlebot to discover new URLs. Effective internal link structure improves crawlability.
- Ensure pages load quickly – Sites that are slow to respond to Googlebot acquisition may have their crawl speeds throttled. Optimizing your page performance will make it crawl faster.
- Eliminate soft 404 errors – Fixing soft 404s caused by CMS misconfigurations ensures that URLs are directed to valid pages and improves crawl success rates.
- Consider adjusting robots.txt – Strict robots.txt can block useful pages. An SEO audit may uncover restrictions that can be safely removed.
Latest educational video series
The latest video comes after Google last week launched an educational “How Search Works” series that sheds light on the search and indexing process.
Our newly released episode on Crawls provides insight into one of the most basic operations of search engines.
In the coming months, Google will produce additional episodes exploring topics such as indexing, quality assessment, and search refinement.
The series is available on Google Search Central’s YouTube channel.
FAQ
What does Google describe as the crawl process?
Google’s crawl process, outlined in a recent episode of our “How Search Works” series, includes the following key steps:
- Googlebot discovers new URLs by following links from known pages it has previously crawled.
- It strategically crawls your site at a customized speed to avoid overloading your servers, taking into account response time and content quality.
- The crawler also uses the latest version of Chrome to render pages, correctly display content loaded by JavaScript, and access only publicly available pages.
- Optimizing technical SEO elements and using a sitemap will help Google crawl your new content.
How can marketers ensure their content is effectively discovered and crawled by Googlebot?
Marketers can employ the following strategies to increase the discoverability and crawlability of their content by Googlebot.
- Implement automatic sitemap generation within your content management system.
- Focus on optimizing technical SEO factors like site architecture and loading speed, and use crawl directives appropriately.
- By configuring your CMS efficiently and using pagination tags, you ensure that frequent content updates don’t deplete your crawl budget.
- Create an effective internal link structure to help discover new URLs.
- Check and optimize your website’s robots.txt file to make sure it’s not overly restrictive for Googlebot.