If you've never heard of a "crawl budget" you're not alone. It's actually something that most publishers don't have to worry about, as long as your pages tend to be crawled by Googlebot the same day that they're published. Also, if you have a small site with fewer than a few thousand URLs, it is probably already being crawled in a timely fashion.
However, Google felt it necessary to explain in a blog post exactly what a crawl budget is and what factors can impact a quick crawl by Google.
First there is the Crawl Rate Limit which limits the maximum fetching rate for a given site in order to not degrade the user experience. "Simply put, this represents the number of simultaneous parallel connections Googlebot may use to crawl the site, as well as the time it has to wait between the fetches," says Gary Illyes who is part of the webmaster team at Google.
The crawl rate will go up and down based on site speed and server errors. Fast responsive sites with no errors will get crawled more. Also, in the Search Console webmasters can manually add limits to crawling.
"Even if the crawl rate limit isn't reached, if there's no demand from indexing, there will be low activity from Googlebot," said Illyes. He says that popular sites get priority crawling and that in general Google wants to crawl new content. "Taking crawl rate and crawl demand together we define crawl budget as the number of URLs Googlebot can and wants to crawl."
Having many low-value-add URLs can negatively affect a site's crawling and indexing. In order of significance, low-value-add URLs fall into these categories:
- Faceted navigation and session identifiers
- On-site duplicate content
- Soft error pages
- Hacked pages
- Infinite spaces and proxies
- Low quality and spam content
"Wasting server resources on pages like these will drain crawl activity from pages that do actually have value, which may cause a significant delay in discovering great content on a site," says Illyes.
Read their full blog post here.