Google: Not Having Robots.txt is “A Little Bit Risky”By: Chris Crum - August 24, 2011
Robots.txt as you may know, lets Googlebot know whether you want it to crawl your site or not.
Google’s Matt Cutts spoke about a few options for these files in the latest Webmaster Help video, in response to a user-submitted question: “Is it better to have a blank robots.txt file, a robots.txt that contains User-agent: *Disallow:” or no robots.txt file at all?”
“I would say any of the first two,” Cutts responded. “Not having a robots.txt file is a little bit risky – not very risky at all, but a little bit risky because sometimes when you don’t have a file, your web host will fill in the 404 page, and that could have various weird behaviors. Luckily we are able to detect that really, really well, so even that is only like a 1% kind of risk.”
“But if possible, I would have a robots.txt file whether it’s blank or you say User-agent: *Disallow nothing, which means everybody’s able to crawl anything they want is pretty equal,” said Cutts. “We’ll treat those syntactically as being exactly the same. For me, I’m a little more comfortable with User-agent: * and then Disallow: just so you’re being very specific that ‘yes, you’re allowed to crawl everything’. If it’s blank then yes, people were smart enough to make the robots.txt file, but it would be great to have just like that indicator that says exactly, ‘ok, here’s what the behavior is that’s spelled out.’ Otherwise, it could be like maybe somebody deleted everything in the file by accident.”
“If you don’t have one at all, there’s just that little tiny bit of risk that your web host might do something strange or unusual like return a ‘you don’t have permission to read this’ file, which you know, things get a little strange at that point.,” Cutts reiterated.
All of this, of course, assumes that you want Google to crawl your site.
In another video from Cutts we looked at yesterday, he noted that Google will sometimes use DMOZ to fill in snippets in search results when they can’t otherwise see the page’s content because it was blocked by robots.txt. He noted that Google is currently looking at whether or not it wants to continue doing this.