Crawl budget refers to the maximum number of pages that a search engine can and wants to crawl on a given website. Google determines the amount of time it spends crawling your site (crawl budget) by taking into account how often it crawls other sites (crawl rate limit) and how much demand there is for your site (crawl demand).
- Crawl rate limit: The speed of your pages, crawl errors, and the crawl limit set in Google Search Console (website owners have the option of reducing Googlebot’s crawl of their site) can all impact your crawl rate limit.
- Crawl demand: The popularity of your pages as well as how fresh or stale they are can impact your crawl demand.
How Does Crawl Budget Work
The article by Gary Illyes provides most of the information we have on how crawl budget works. In this post, Illyes emphasized that:
- Crawl budget should not be something most publishers have to worry about.
- If a site has less than a few thousand URLs, most of the time it will be crawled efficiently.
The following are key concepts regarding crawl budget that will aid in understanding it better.
Crawl Rate Limit
Google is aware that its search bot can have a negative impact on websites if it is not careful, so it has put measures in place to ensure that its crawlers only visit websites as often as is sustainable for that site.
Google uses crawl rate limit to help it determine how much time to spend crawling a website.
Here’s how it works:
- Googlebot will crawl a website.
- The bot will push the site’s server and see how it responds.
- Googlebot will then lower or raise the limit.
To change the limit on your website, go to the Crawl Rate Settings page on Google search console.
Googlebot considers the demand for a particular URL from the index when deciding how active or passive it should be.
The two factors that play a significant role in determining crawl demand are:
- URL Popularity: Popular pages will get indexed more frequently than ones that aren’t.
- Staleness: Google’s system will prevent stale URLs and will benefit up to date content.
Google uses factors like the crawl rate limit and crawl demand to determine the number of URLs Googlebot can and wants to crawl (crawl budget).
Optimizing Crawl Budget
1. Preventing Google from Crawling Your Non-Canonical URLs
If you are not familiar with canonical tags, they are tags that tell Google which version of a page is the preferred, primary version.
Say, for example, you have a product category page for “women’s jeans” located at /clothing/women/jeans, which allows visitors to sort by price: low to high (i.e. faceted navigation).
This might change the URL to /clothing/women/jeans?sortBy=PriceLow. If you don’t want the same page to be indexed twice, then you wouldn’t want to change the order of the jeans.
The canonical tag would be added to the /clothing/women/jeans?sortBy=PriceLow page, indicating that it is the primary version of that page and the other version is a duplicate. This is also the case for URL parameters that are added as session identifiers.
Botify’s non-indexable indicator makes it easy to see when Google is spending time crawling pages that aren’t important.
If we take an eCommerce client as an example, they may have a lot of URLs that can be crawled, but are not the main/correct ones. The study found that 97% of the one million pages crawled were non-canonical.
Despite only 25,000 indexable URLs, Google only managed to crawl a little over half in the span of a month.
Despite the fact that Google’s crawl budget would have allowed for more than the total number of indexable URLs, the rest of the budget was spent on URLs that could not be indexed.
Since the site has the potential to be nearly 100% crawled, it is more likely that more pages will drive traffic.
If we don’t crawl non-canonical URLs, we might be able to crawl other pages more often. And we find that pages that get crawled more often tend to get more visits.
Google has said for years that this is a waste of their resources, but the problem persists.
If you have pages on your website that use up a lot of server resources, it will take away from the resources that are being used for pages that are actually important. This could prevent Google from finding your good content.
You can use your site’s robots.txt file to let search engine bots know which pages to crawl and which to ignore. If you’re unfamiliar, robots.txt files live at the root of websites and look like this:
To learn more about how to create robots.txt files, visit Google’s documentation.
The speed of a website page is important for both the user experience and for ranking in search engines, but it is also a factor that affects how much time a search engine spider has to spend crawling the site.
3. Minimizing Crawl Errors and Non-200 Status Codes
One of the factors that Google takes into consideration when deciding how much time to spend crawling your site is whether or not the crawler is encountering errors.
If the Googlebot encounters a lot of errors while crawling your site, such as 500 server errors, your crawl rate limit could be lowered, which would result in a reduced crawl budget.
If you notice a lot of 5xx errors, you should check your server’s capabilities.
But non-200 status codes can also simply constitute waste. Why bother having Google crawl pages you’ve deleted or redirected when you could just direct them to your live, current URLs?
Do not use multiple intermediaries to find your content as this will use up your ‘crawl budget’. Instead, link to the ultimate destination.
Also, avoid common XML sitemap mistakes such as:
- Listing non-indexable pages like non-200s, non-canonicals, non-HTML, and no-indexed URLs.
- Forgetting to update your sitemap after URLs change during a site migration
- Omitting important pages, and more.
Make sure to include only live URLs that you want search engines to crawl and index. leaving out key pages can be detrimental. Have old product pages? You should expire your pages and remove them from your sitemap when they are no longer relevant.
4. Checking Your Crawl Rate Limit in Google Search Console
You can control how often Googlebot crawls your site by changing the crawl rate in Google Search Console. This tool can help control how often Google crawls your site, which is part of how Google determines how much time to spend on your site during each visit, so it’s an important one to understand.
If you want to, you can change the rate at which Google’s algorithms crawl your site.
If the crawl rate is too high, Googlebot’s crawl could put unnecessary strain on your server. Webmasters have the option to set a limited crawl rate to prevent this. google could find less of your important content
To To change your crawl rate, go to the crawl rate settings page for the property you want to change. There are two options you will see: “Let Google optimize” and “Limit Google’s maximum crawl rate.”
To increase your crawl rate, check if “Limit Google’s maximum crawl rate” has been selected by mistake.
5. Increasing the Popularity of Your Pages
Crawling is the process where Google goes through the internet and looks for new content. More popular URLs tend to be crawled more often.
One way that Google may determine the popularity or importance of a page is by its depth. The “click depth” refers to the number of clicks it takes to get to a certain page on a website from the home page.
Another signal of popularity on your site is internal linking. If a page is linked to by many other pages, it is likely that the page is popular.
It’s a good idea to keep your important pages close to the home page and link to them often so Google can better understand how popular and important they are.
It’s not possible to include links to every single page on your website from the home page, so be thoughtful and deliberate about which pages you do include, as well as the overall organization of your website.
If your website has pages that are rarely visited or linked to, Google is likely to see them as less popular and crawl them less often.
6. Refreshing Stale Content
If your pages have not been updated recently, Google may stop crawling them. Google tries to stop pages from becoming old and outdated in its search results.
You can tell if you have any out-of-date content on your website by looking at posts that were published before a certain date.
To view posts older than three months old, change the date range in the filter to three months or more. If a site publishes infrequently, you may opt to view posts that are more than three years old. It just depends on your cadence.
Tracking Crawl Budget
It can be difficult to figure out your current crawl budget as the new Search Console hides most legacy reports. You can use a tool like Screaming Frog to help you monitor your crawl budget. The idea of server logs may sound extremely technical for a lot of people.
Monitoring your crawl budget can be done in two ways.
Google Search Console
To find your website’s crawl stats, go to Search Console > Legacy Tools and Reports > Crawl Stats.
Access the Crawl Stats report to see what Googlebot has been up to over the past 90 days. (Can you see any patterns?)
Server logs store every request made to your webserver. An entry is added to the access log file every time a user or Googlebot visits your site.
Googlebot (the Google search engine web crawler) will leave an entry in your website’s access log file every time it visits. This log file can be used to determine how often Googlebot visits your website. You can either manually or automatically analyze the file.
There are commercial log analyzers that can help you understand what Google bot is doing on your website. These tools can provide you with relevant information that can help you improve your site.
Server log analysis reports will show:
- How frequently your site is being crawled.
- Which pages is Googlebot accessing the most.
- What type of errors the bot has encountered.
The most popular log analyzer tools are as follows:
- SEMrush Log File Analyzer
- SEO Log File Analyser by Screamingfrog
- OnCrawl Log Analyzer
- Botlogs by Ryte
Factors Affecting Crawl Budget
If your website has a lot of pages that aren’t important, it can hurt how easily search engines can find and use your site. Crawling potential refers to how often a site is crawled by a search engine. Things that will reduce a site’s crawling potential include infinite scrolling, duplicate content, and spam.
Server and Hosting Setup
Google considers the stability of each website. If a website continually crashes, Googlebot will stop crawling it.
Faceted Navigation and Session Identifiers
If your website has a lot of pages that constantly change, it could cause issues with the URL structure as well as how accessible the site is. If you’re having issues with Google indexing your pages, it’s likely due to one of these reasons.
Google users are not benefited by duplication.
Low Quality Content and Spam
If a lot of the content on your website is low quality or spam, the crawler will lower your budget.
If you make network requests while rendering a page, it could decrease the amount of pages you’re able to crawl. Not sure what rendering is? It is the process of inserting data from APIs and/or databases into webpages. Adding a sitemap to your website can help Google more easily find and understand the content and structure of your site.
The crawl budget is the number of pages that a search engine will crawl on a website. This is relevant for a specific type of website because it can help to optimize the website for search engines. The term “crawl budget” might changes or disappear in the near future, as Google is constantly evolving and testing new solutions for users.
It is important to remember the basic principles and to concentrate on activities that will be beneficial to those who will be using your product.