Join the WebProWorld Forum!

Disabling Google and Other Search Engines From Crawling a Site

Reader question: I have a online database of horror movies, and I have a good Google rank. In my traffic logs I noted the last month a really growing of the bandwidth: one of the most important browsers of the server logs is Googlebot, so this traffic was generated for the spidering engine of Google. I have the 20 Gb bandwidth limit and I don't want to pay for excess, so I disable Google into my Web site. My question is:

If I disable Google to my Web site, its possible Google.com erase or drop down my Web site for his directory?

Many thanks for your time and keep up the good work.

Answer: Many thanks for posting this question because Web server issues and excluding robots are a very important aspect of search engine marketing (SEM). The reader did not specifically state how he kept Googlebot from spidering his site. I am assuming that the reader used the Robots Exclusion Protocol.

Robots Exclusion Protocol

The Robots Exclusion Protocol is a means of instructing robots (or spiders) from crawling a site. With the Robots Exclusion Protocol, Web site owners can instruct search engine spiders to not index individual Web pages, subdirectories, or even an entire site. Instructions can also be tailored for individual search engines.

There are two types of robots exclusion: a meta tag or a text file.

To let Google know that you do not want a page crawled, you can create the following meta tag:

<META NAME="GOOGLEBOT" CONTENT="NOINDEX, NOFOLLOW">

To let all search engine spiders know that you do not want a page crawled, you can create the following meta tag:

<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">

For this tag to be effective on a whole site, you will have to place this tag on every page of your site. This process can be quite boring and time consuming. For that reason, I prefer to use the robots exclusion text file, commonly referred to as robots.txt, because it can easily be applied to an entire site.

The robots.txt is a text file that you place on your server that instructs search engine spiders to NOT record the information in specified areas on your Web site, and not to follow the links on your Web site. In other words, text file lets the search engine spiders know which sections of your site are off limits.

I usually create my robots.txt files in NotePad (PC) or SimpleText (Mac). But you can create simple text files in HTML software such as Dreamweaver.

Google will request the robots.txt file before trying to index any page within your site. For example, if do not want Google to record any of the information on the site, type the following text into a text editor:

User-agent: Googlebot
Disallow: /

Be sure to save the file as robots.txt. Do not use any other file extension. If you save the file as a Word document and call it robots.doc, Google will ignore that file.

When search engines crawl to frequently

I understand the reader's concern about bandwidth. If Google or any search engine crawls a site too frequently, it takes up bandwidth. All of us pay for bandwidth.

However, when you instruct Google (or any search engine) to not crawl your site, you are essentially communicating, "Don't show my Web pages in your search results."

I do not believe the reader's intention was to exclude all of his Web pages from Google search engine results pages (SERPs). He just wants Google not to request pages from his server so often.

Google actually has a Web page with this information and an email address. This is a direct quote from Google's Webmaster FAQs page:

"Please send an email to googlebot@google.com with the name of your site and a detailed description of the problem. Please also include a portion of the weblog that shows Google accesses, so we can track down the problem more quickly on our end."

The URL for the information on this page is at http://www.google.com/webmasters/faq.html.

When to use the Robots Exclusion Protocol

Some content is not important to site visitors and search engines, such as items in a CGI-BIN directory. When your target audience searches for information, they are not interested in your site's programs that generate your forms or your drop-down menus. They are not interested in a section of a Web site that is under construction. They are not interested in redundant content, either. Using the Robots Exclusion Protocol ensures that unnecessary information is not shown in search results pages.

For more details about the Robots Exclusion Protocol, please visit: http://www.robotstxt.org/wc/faq.html.

Shari Thurow is Marketing Director at Grantastic Designs, Inc., a full-service search engine marketing, web and graphic design firm. This article is excerpted from her book, Search Engine Visibility (http://www.searchenginesbook.com) published in January 2003 by New Riders Publishing Co. Shari can be reached at shari@grantasticdesigns.com.

Shari Thurow Answers SEO Questions: Click Here For Free Answers

Digg This! StumbleUpon This!
AddThis Social Bookmark Widget

About the author:
Shari Thurow is Marketing Director at Grantastic Designs, Inc., a full-service search engine marketing, web and graphic design firm. This article is excerpted from her book, Search Engine Visibility (http://www.searchenginesbook.com) published in January 2003 by New Riders Publishing Co. Shari can be reached at shari@grantasticdesigns.com.

Shari Thurow Answers SEO Questions: Click Here For Free Answers

Comments

Great topic

I was a bit cofused on the topic (I am still cofused a bit). why on the earth a webmaster will choose not to crawl his pages when google is one of the best source of traffic.I read somewhere that if you have many forms of same content  (such as html form and orint form) then you apply it but applying robots.txt to whole of the site is absurd i thin.

Delhi india & Delhi travel guide

What legal right does Google have to crawl my website?

Hi,

While most discussion on the web is about how to get a website IN to Google, I recently had a client who did NOT want to be listed in Google (for whatever reason!).

So, he asked me, "Does Google have the legal right to crawl my website and list information in its search engine?"

My response was to tell him that once a website is created online it becomes part of the public domain and therefore search engines are entitled to (and do) visit the site with their crawlers, spiders, robots, etc.

Is this the correct legal answer?

Tony

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Post new comment

The content of this field is kept private and will not be shown publicly.
CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.
3 + 14 =
Solve this simple math problem and enter the result. E.g. for 1+3, enter 4.