Why Your Robots.txt Blocked URLs May Show up in Google
Matt Cutts has appeared in yet another Google Webmaster Video, and this time he has a whiteboard with him so he can illustrate what he’s talking about. What he’s talking about this time are uncrawled URLs in search results.
Cutts says Google gets a lot of complaints from webmasters who say the search engine is violating their robots.txt files, with which they intend to keep Google from crawling certain pages. Sometimes those URLs still end up in search results.
According to Matt, what is happening in most cases is that when someone’s saying "I blocked example.com/go" in robots.txt, it turns out that the snippet Google returns in search results just brings back a URL with no text for the snippet. The reason for this is that Google didn’t actually crawl the page.
"It did abide by robots.txt. You told us this page is blocked, so we did not fetch this page," says Matt. It is a URL reference. "We saw a link to it, but we didn’t fetch the page itself," he explains.
Google didn’t actually fetch the page itself, and that’s why there’s no text snippet. In case you were wondering what the point of showing them at all is, Cutts breaks out an example looking at the California DMV, whose site is: www.dmv.ca.gov.
Cutts notes that at one point the California Department of Motor Vehicles had a robots.txt that blocked all search engines. "Now these days pretty much every site is savvy enough, you know, at one point the New York Times and eBay and a whole bunch of different sites would use robots.txt," he says.
If someone searches for "California DMV" in Google, there’s pretty much only one answer, he says. So that is the answer that Google wants to return. Luckily for Google a lot of people were linking to that page with the anchor text "California DMV". That helps Google be able to return the result without having to crawl the page.
Cutts also says that they can get descriptions from a directory like the Open Directory Project (DMOZ). He cites Nissan and Metallica.com as examples of sites that used to block Google with robots.txt. They had been listed in the Open Directory Project, however, and Google went and got the information from there to include as the snippet.
When this type of thing happens, it looks like the page was crawled, when in fact it wasn’t. "So we are able to return something that can be very helpful to users without violating robots.txt by not crawling that page," says Cutts.
He also notes that when you don’t want pages to show up, you can use the "noindex" meta tag at the top of the page. When Google sees this tag, it drops the page from its search results completely. Another option is the URL removal tool.