<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>WebProNews &#187; Robots.txt</title>
	<atom:link href="http://www.webpronews.com/tag/robots-txt/feed" rel="self" type="application/rss+xml" />
	<link>http://www.webpronews.com</link>
	<description>Breaking News in Tech, Search, Social, &#38; Business</description>
	<lastBuildDate>Sat, 18 May 2013 22:49:00 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.2</generator>
		<item>
		<title>Google Removing Subscriber Stats Feature From Webmaster Tools</title>
		<link>http://www.webpronews.com/google-removing-subscriber-stats-feature-from-webmaster-tools-2012-04</link>
		<comments>http://www.webpronews.com/google-removing-subscriber-stats-feature-from-webmaster-tools-2012-04#comments</comments>
		<pubDate>Wed, 25 Apr 2012 14:44:26 +0000</pubDate>
		<dc:creator>Zach Walton</dc:creator>
				<category><![CDATA[Developer]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[Google Webmaster Tools]]></category>
		<category><![CDATA[Robots.txt]]></category>
		<category><![CDATA[Webmaster]]></category>

		<guid isPermaLink="false">http://www.webpronews.com/?p=144990</guid>
		<description><![CDATA[Just like when Google announced the changes coming to API deprecation and some APIs even being retired, Google is now looking at the Webmaster tools it can phase out. The company decides these tools&#8217; fate by seeing &#8220;if they&#8217;re still &#8230;]]></description>
			<content:encoded><![CDATA[<p>Just like when Google announced the changes <a href="http://www.webpronews.com/google-updates-its-deprecation-policy-2012-04">coming to API deprecation</a> and some APIs even being retired, Google is now looking at the Webmaster tools <a href="http://googlewebmastercentral.blogspot.com/2012/04/webmaster-tools-spring-cleaning.html">it can phase out</a>. The company decides these tools&#8217; fate by seeing &#8220;if they&#8217;re still useful in comparison to the maintenance and support they require.&#8221; </p>
<p>The first tool to get the boot is the Subscriber stats feature. The reported &#8220;the number of subscribers to a site&#8217;s RSS or Atom feeds.&#8221; Google already has the same features included in their Feedburner tool so they suggest users of the current Subscriber stats feature switch to that. </p>
<p>The second removal is for the Create robots.txt tool. This allowed Web sites to generate a robots.txt file that would block a section of a Web site from being crawled by the Googlebot. The reason for removal is that it got very little use. Google says that those people who did use the feature can easily create their own since there are a multitude of other services that create robots.txt files. </p>
<p>The last feature hitting the cutting block is the Site performance feature that&#8217;s part of the Webmaster Tools Lab. It let Webmasters check out the average load time of a site&#8217;s pages. The reason for its removal is the same as the last &#8211; low usage. If you need to check your site&#8217;s performance, Google provides the same features in the Site Speed feature in Google Analytics or Google&#8217;s PageSpeed tool. </p>
<p>As you can see, these removals are more about removing redundancy than any kind of breaking changes. The retiring of these features only means that Webmasters have to switch to one of the many other options available. Chances are you&#8217;re already using one of those alternative options. If you are still using one of the above features, you have two weeks to say your goodbyes. </p>
]]></content:encoded>
			<wfw:commentRss>http://www.webpronews.com/google-removing-subscriber-stats-feature-from-webmaster-tools-2012-04/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Google: Not Having Robots.txt is &#8220;A Little Bit Risky&#8221;</title>
		<link>http://www.webpronews.com/google-robots-matt-cutts-2011-08</link>
		<comments>http://www.webpronews.com/google-robots-matt-cutts-2011-08#comments</comments>
		<pubDate>Wed, 24 Aug 2011 19:37:43 +0000</pubDate>
		<dc:creator>Chris Crum</dc:creator>
				<category><![CDATA[Search]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[Matt Cutts]]></category>
		<category><![CDATA[Robots.txt]]></category>
		<category><![CDATA[search]]></category>
		<category><![CDATA[Webmasters]]></category>

		<guid isPermaLink="false">http://www.webpronews.com/?p=74337</guid>
		<description><![CDATA[Robots.txt as you may know, lets Googlebot know whether you want it to crawl your site or not. Google&#8217;s Matt Cutts spoke about a few options for these files in the latest Webmaster Help video, in response to a user-submitted &#8230;]]></description>
			<content:encoded><![CDATA[<p>Robots.txt as you may know, lets Googlebot know whether you want it to crawl your site or not. </p>
<p>Google&#8217;s Matt Cutts spoke about a few options for these files in the latest Webmaster Help video, in response to a user-submitted question:  &#8220;Is it better to have a blank robots.txt file, a robots.txt that contains User-agent: *Disallow:&#8221; or no robots.txt file at all?&#8221;</p>
<p>&#8220;I would say any of the first two,&#8221; Cutts responded. &#8220;Not having a robots.txt file is a little bit risky &#8211; not very risky at all, but a little bit risky because sometimes when you don&#8217;t have a file, your web host will fill in the 404 page, and that could have various weird behaviors. Luckily we are able to detect that really, really well, so even that is only like a 1% kind of risk.&#8221;</p>
<p>&#8220;But if possible, I would have a robots.txt file whether it&#8217;s blank or you say User-agent: *Disallow nothing, which means everybody&#8217;s able to crawl anything they want is pretty equal,&#8221; said Cutts. &#8220;We&#8217;ll treat those syntactically as being exactly the same. For me, I&#8217;m a little more comfortable with User-agent: * and then Disallow: just so you&#8217;re being very specific that &#8216;yes, you&#8217;re allowed to crawl everything&#8217;. If it&#8217;s blank then yes, people were smart enough to make the robots.txt file, but it would be great to have just like that indicator that says exactly, &#8216;ok, here&#8217;s what the behavior is that&#8217;s spelled out.&#8217; Otherwise, it could be like maybe somebody deleted everything in the file by accident.&#8221;</p>
<p><center><iframe width="616" height="376" src="http://www.youtube.com/embed/P7GY1fE5JQQ" frameborder="0" allowfullscreen></iframe></center></p>
<p>&#8220;If you don&#8217;t have one at all, there&#8217;s just that little tiny bit of risk that your web host might do something strange or unusual like return a &#8216;you don&#8217;t have permission to read this&#8217; file, which you know, things get a little strange at that point.,&#8221; Cutts reiterated. </p>
<p>All of this, of course, assumes that you want Google to crawl your site. </p>
<p>In another <a href="http://www.webpronews.com/google-gives-an-update-on-how-it-thinks-about-dmoz-2011-08">video from Cutts we looked at yesterday</a>, he noted that Google will sometimes use DMOZ to fill in snippets in search results when they can&#8217;t otherwise see the page&#8217;s content because it was blocked by robots.txt. He noted that Google is currently looking at whether or not it wants to continue doing this. </p>
]]></content:encoded>
			<wfw:commentRss>http://www.webpronews.com/google-robots-matt-cutts-2011-08/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Perfect 10 Fails Where Google Succeeds</title>
		<link>http://www.webpronews.com/perfect-10-fails-where-google-succeeds-2011-08</link>
		<comments>http://www.webpronews.com/perfect-10-fails-where-google-succeeds-2011-08#comments</comments>
		<pubDate>Fri, 05 Aug 2011 19:05:21 +0000</pubDate>
		<dc:creator>Chris Richardson</dc:creator>
				<category><![CDATA[Search]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[Google Image Search]]></category>
		<category><![CDATA[Perfect 10]]></category>
		<category><![CDATA[Robots.txt]]></category>

		<guid isPermaLink="false">http://www.webpronews.com/?p=72676</guid>
		<description><![CDATA[Perfect10.com, a site that features incredibly attractive female models in various positions of nude repose has been long after Google because of the site&#8217;s content appearing in Google Image Searches. Their struggle has been going on for sometime now. In &#8230;]]></description>
			<content:encoded><![CDATA[<p>Perfect10.com, a site that features incredibly attractive female models in various positions of nude repose has been long after Google because of the site&#8217;s content appearing in Google Image Searches.  Their struggle has been going on for sometime now.  </p>
<p>In fact, WebProNews has articles <a href="http://www.webpronews.com/perfect-10-comes-out-swinging-at-google-again-2009-12">dating back to 2005</a> discussing this very subject.  However, according to the latest appeal loss, the saga may finally be coming to an end.  According to a post <a href="http://news.cnet.com/8301-31001_3-20088391-261/porn-studio-loses-appeal-in-google-copyright-case/">over at CNet</a>, the latest attempt by Perfect 10, one that seeks to punish Google for being a search engine that works as its supposed to, has been denied.</p>
<p>Here&#8217;s the gist:</p>
<blockquote><p><em>The Ninth Circuit ruled that Perfect 10, a porn studio with a long history of filing copyright suits against Internet companies, rejected a request for a preliminary injunction against Google. The court said that Perfect 10 didn&#8217;t present enough evidence to prove that it would suffer irreparable harm from the photos.</em></p></blockquote>
<p>You see, Perfect 10&#8242;s content is primarily hidden behind a pay wall, meaning, in order to see their index of naked women, you have to pay for it.  Unfortunately for the Perfect 10 web developer, who apparently didn&#8217;t understand how to <a href="http://en.wikipedia.org/wiki/Robots_exclusion_standard">manipulate a robots.txt file</a>, apparently, Perfect 10 images began appearing in Google&#8217;s image search results.</p>
<p>Regardless of the fact that there are an unending amount of <a href="http://www.buildwebsite4u.com/building/web-crawlers.shtml">tutorials</a> and instructional sites that inform developers how to keep their paid content from appearing in free image searches, for some reason, Perfect 10 felt it was Google&#8217;s fault their paid content was going out to the world for free.</p>
<p>In fact, Perfect 10&#8242;s claim was Google&#8217;s image search cost them something in the area of $50 million.  Disregarding the fact that, again, the blame should&#8217;ve been placed directly on the head of the Perfect 10 web developer, the company <a href="http://www.webpronews.com/perfect-10-comes-out-swinging-at-google-again-2009-12">tried</a>, <a href="http://www.webpronews.com/perfect-10-tries-again-this-time-with-msn-2007-08">and tried</a>, and <a href="http://www.webpronews.com/perfect-10-loses-again-2007-07">tried again</a> to make Google (and others) pay for their design inadequacies. </p>
<p>Each time, these attempts did little but clog up a court system that&#8217;s already bursting at the seams.</p>
<p>There was, apparently, a slight moment of victory when another judge upheld a Perfect 10 filing against Megaupload, a file-sharing site that allows others to swap files via email or direct download.  Granted, Megaupload doesn&#8217;t have the money Google does, but even the smaller victories count, right?</p>
<p>It should also be noted that when a &#8220;<a href="http://www.google.com/search?rlz=1C1AVSW_enUS443US443&#038;q=perfect+10&#038;um=1&#038;ie=UTF-8&#038;tbm=isch&#038;source=og&#038;sa=N&#038;hl=en&#038;tab=wi&#038;biw=1366&#038;bih=667#um=1&#038;hl=en&#038;safe=off&#038;rlz=1C1AVSW_enUS443US443&#038;tbm=isch&#038;sa=1&#038;q=perfect+10&#038;oq=perfect+10&#038;aq=f&#038;aqi=g10&#038;aql=&#038;gs_sm=e&#038;gs_upl=2331700l2331700l2l2331988l1l1l0l0l0l0l251l251l2-1l1l0&#038;bav=on.2,or.r_gc.r_pw.r_cp.&#038;fp=e59c7ce44a08213c&#038;biw=1366&#038;bih=667">Perfect 10</a>&#8221; search is conducted in Google Images, the amount of content originating from the site in question is negligible, even if SafeSearch is turned off.  This mean that, even though the Perfect 10 web developers finally figured out how to protect their paid content, the company still wants to nail Google to the cross.  </p>
<p>A <a href="http://perfect10.com/blog.php?id=27&#038;p=&#038;search=#comments">semi-recent post on the Perfect 10 blog</a> reveals as much. The title, &#8220;Google Is Destroying The Entertainment Industry&#8221; reeks of a &#8220;give me back my money&#8221; approach, courtesy of Mel Gibson and South Park:</p>
<p><center><iframe width="616" height="492" src="http://www.youtube.com/embed/x2oN6ijCNUY" frameborder="0" allowfullscreen></iframe></center><br />
If at first you don&#8217;t succeed in making others pay your way, try, try again.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.webpronews.com/perfect-10-fails-where-google-succeeds-2011-08/feed</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Webmasters: Googlebot Caught in Spider Trap, Ignoring Robots.txt</title>
		<link>http://www.webpronews.com/googlebot-spider-trap-2011-08</link>
		<comments>http://www.webpronews.com/googlebot-spider-trap-2011-08#comments</comments>
		<pubDate>Mon, 01 Aug 2011 13:48:45 +0000</pubDate>
		<dc:creator>Chris Crum</dc:creator>
				<category><![CDATA[Search]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[Googlebot]]></category>
		<category><![CDATA[Robots.txt]]></category>
		<category><![CDATA[search]]></category>
		<category><![CDATA[SEO]]></category>
		<category><![CDATA[spider traps]]></category>

		<guid isPermaLink="false">http://www.webpronews.com/?p=72052</guid>
		<description><![CDATA[Sometimes webmasters set up a spider trap or crawler trap to catch spambots or other crawlers that waste their bandwidth. If some webmasters are right, Googlebot (Google&#8217;s crawler) seems to be having some issues here. In the WebmasterWorld forum, member &#8230;]]></description>
			<content:encoded><![CDATA[<p>Sometimes webmasters set up a spider trap or crawler trap to catch spambots or other crawlers that waste their bandwidth. If some webmasters are right, Googlebot (Google&#8217;s crawler) seems to be having some issues here. </p>
<p>In the WebmasterWorld forum, member Starchild started <a href="http://www.webmasterworld.com/google/4346138.htm">a thread</a> by saying, &#8220;I saw today that Googlebot got caught in a spider trap that it shouldn&#8217;t have as that dir is blocked via robots.txt. I know of at least one other person recently who this has also happened to. Why is GB ignoring robots?&#8221;</p>
<p>Another member suggested that Starchild was mistaken, as such claims have been made in the past, only to find that there were other issues at play. </p>
<p>Starchild responded, however, that it had been in place for &#8220;many months&#8221; with no changes. &#8220;Then I got a notification it was blocked (via the spidertrap notifier). Sure enough, it was. Upon double checking, Google webmaster tools reported a 403 forbidden error. IP was google. I whitelisted it, and Google webmaster tools then gave a success.&#8221;</p>
<p>Another ember, nippi, said they also got hit by it 4 months after setting up a spider trap, which was &#8220;working fine&#8221; until now. </p>
<p>&#8220;The link to the spider trap is rel=Nofollowed, the folder is banned in robot.txt. The spider trap works by banning by ip address, not user agent so its not caused by a faker &#8211; and of course robots.txt was setup up correctly and prior, it was in place days before the spider trap was turned on, and it&#8217;s run with no problems for months,&#8221; nippi added. &#8220;My logs show, it was the real google, from a real google ip address that ignored my robots.txt, ignored rel-nofollow and basically killed my site.&#8221;</p>
<p>We&#8217;ve reached out to Google for comment, and if and when we receive a response. </p>
<p>Meanwhile, Barry Schwartz is <a href="http://www.seroundtable.com/google-admits-fault-13787.html">reporting</a> that one site lost 60% of its traffic instantly, due to a bug in Google&#8217;s algorithm. He points to a <a href="http://www.google.com/support/forum/p/Webmasters/thread?tid=11563538bda38f29&#038;hl=en">Google Webmaster Help forum thread</a> where Google&#8217;s Pierre Far said:</p>
<p><em>I reached out to a team internally and they identified an algorithm that is inadvertently negatively impacting your site and causing the traffic drop. They&#8217;re working on a fix which hopefully will be deployed soon.</em></p>
<p>Google&#8217;s Kaspar Szymanski comment on Schwartz&#8217;s post, &#8220;While we can not guarantee crawling, indexing or ranking of sites, I believe this case shows once again that our Google Help Forum is a great communication channel for webmasters.&#8221;</p>
]]></content:encoded>
			<wfw:commentRss>http://www.webpronews.com/googlebot-spider-trap-2011-08/feed</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Developer Shares Story of Being Threatened by Facebook for Crawling</title>
		<link>http://www.webpronews.com/developer-shares-story-of-being-threatened-by-facebook-for-crawling-2010-04</link>
		<comments>http://www.webpronews.com/developer-shares-story-of-being-threatened-by-facebook-for-crawling-2010-04#comments</comments>
		<pubDate>Tue, 06 Apr 2010 22:23:52 +0000</pubDate>
		<dc:creator>Chris Crum</dc:creator>
				<category><![CDATA[Social Media]]></category>
		<category><![CDATA[Facebook]]></category>
		<category><![CDATA[law]]></category>
		<category><![CDATA[Legal]]></category>
		<category><![CDATA[Pete Warden]]></category>
		<category><![CDATA[Robots.txt]]></category>
		<category><![CDATA[social networks]]></category>
		<category><![CDATA[web development]]></category>

		<guid isPermaLink="false">http://www.webpronews.com/?p=53564</guid>
		<description><![CDATA[<p>Pete Warden, a former software engineer at Apple, who is now working on his own start-up, <a href="http://petewarden.typepad.com/searchbrowser/2010/04/how-i-got-sued-by-facebook.html">posted an interesting story</a> about how Facebook threatened to sue him for crawling the social network. I reached out to both Warden and Facebook for more details, but so far have only received response from Facebook, who calls&#160; the incident as&#160;&#34;violation of our terms.&#34;<br />
<br /><a href="http://aj.600z.com/aj/136480/0/cc?z=1"><img src="http://aj.600z.com/aj/136480/0/vc?z=1&dim=105992&kw=&click=" width="615" height="80" border="0"></a>]]></description>
			<content:encoded><![CDATA[<p>Pete Warden, a former software engineer at Apple, who is now working on his own start-up, <a href="http://petewarden.typepad.com/searchbrowser/2010/04/how-i-got-sued-by-facebook.html">posted an interesting story</a> about how Facebook threatened to sue him for crawling the social network. I reached out to both Warden and Facebook for more details, but so far have only received response from Facebook, who calls&nbsp; the incident as&nbsp;&quot;violation of our terms.&quot;</p>
<p>But first, Warden&#8217;s story. Read the whole thing in his words <a href="http://petewarden.typepad.com/searchbrowser/2010/04/how-i-got-sued-by-facebook.html">here</a> for more context about what he wanted to do with the data, but to make a long story short, he was building a tool to bring data from email and various social networks into one place to make it easier for users to manage their contacts, and he crawled Facebook. He says he checked Facebook&#8217;s robot.txt, and that &quot;they welcome the web crawlers that search engines use to gather their data,&quot; so he wrote his own. He was able to obtain data like which pages people were fans of and links to a few of their friends. He created a map showing how different countries, states and cities were connected to each other and released it so that others could use the information. Once Facebook caught wind of this, they threatened legal action. Warden writes:</p>
<p><em>Their contention was robots.txt had no legal force and they could sue anyone for accessing their site even if they scrupulously obeyed the instructions it contained. The only legal way to access any web site with a crawler was to obtain prior written permission.</p>
<p>Obviously this isn&#8217;t the way the web has worked for the last 16 years since robots.txt was introduced, but my lawyer advised me that it had never been tested in court, and the legal costs alone of being a test case would bankrupt me. With that in mind, I spent the next few weeks negotiating a final agreement with their attorney. They were quite accommodating on the details, such as allowing my blog post to remain up, and initially I was hopeful that they were interested in a supervised release of the data set with privacy safeguards. Unfortunately it became clear towards the end that they wanted the whole set destroyed. </em></p>
<p><a href="http://www.andrewnoyes.net/bio.html"><img align="right" src="http://images.ientrymail.com/webpronews/article_pics/andrew-noyes.jpg" alt="Andrew Noyes, Facebook Public Policy Communications Manager talks Pete Warden crawling Facebook data" title="Andrew Noyes, Facebook Public Policy Communications Manager talks Pete Warden crawling Facebook data" style="margin: 10px;" /></a>Facebook Public Policy Communications Manager Andrew Noyes tells WebProNews, &quot;Pete Warden aggregated a large amount of data from over 200 million users without our permission, in violation of our terms. He also publicly stated he intended to make that raw data freely available to others. Warden was extremely cooperative with Facebook from the moment we contacted him and he abandoned his plans.&quot;</p>
<p>&quot;We have, and will continue to, act to enforce our terms of service where appropriate,&quot; adds Noyes.</p>
<p>Noyes pointed to <a href="http://www.facebook.com/terms.php">Facebook&#8217;s Statement of Rights and Responsibilities</a>, which states that &quot;You will not collect users&#8217; content or information, or otherwise access Facebook, using automated means (such as harvesting bots, robots, spiders, or scrapers) without our permission.&quot; That&#8217;s under the safety section, by the way.</p>
<p>&quot;I&#8217;m bummed that Facebook are taking a legal position that would cripple the web if it was adopted (how many people would Google need to hire to write letters to every single website they crawled?), concludes Warden. &quot;And a bit frustrated that people don&#8217;t understand that the data I was planning to release is already in the hands of lots of commercial marketing firms, but mostly I&#8217;m just looking forward to leaving the massive distraction of a legal threat behind and getting on with building my startup.&quot;</p>
<p>Hearing some of what both parties have to say on the issue, what are your thoughts? <a href="http://www.webpronews.com/node/53915/talk"><u><strong>Discuss here</strong></u></a>. </p>
<p>If we hear back from Warden or if Facebook offers us more insight into the situation, which I&#8217;m told may still happen, I&#8217;ll update this article.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.webpronews.com/developer-shares-story-of-being-threatened-by-facebook-for-crawling-2010-04/feed</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Google States Case for Online News in WSJ</title>
		<link>http://www.webpronews.com/google-does-more-to-appease-disgruntled-news-publishers-2009-12</link>
		<comments>http://www.webpronews.com/google-does-more-to-appease-disgruntled-news-publishers-2009-12#comments</comments>
		<pubDate>Thu, 03 Dec 2009 18:33:00 +0000</pubDate>
		<dc:creator>Chris Crum</dc:creator>
				<category><![CDATA[Search]]></category>
		<category><![CDATA[Crawlers]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[google news]]></category>
		<category><![CDATA[News]]></category>
		<category><![CDATA[news search]]></category>
		<category><![CDATA[Online News]]></category>
		<category><![CDATA[Robots.txt]]></category>
		<category><![CDATA[Web Crawlers]]></category>

		<guid isPermaLink="false">http://www.webpronews.com/?p=52281</guid>
		<description><![CDATA[<p><strong>Update:&#160;</strong>The Wall Street Journal is <a href="http://online.wsj.com/article/SB10001424052748704107104574569570797550520.html">running a piece</a> from Google&#160;CEO&#160;Eric Schmidt on how Google can help newspapers. It's an interesting read. <br />
]]></description>
			<content:encoded><![CDATA[<p><strong>Update:&nbsp;</strong>The Wall Street Journal is <a href="http://online.wsj.com/article/SB10001424052748704107104574569570797550520.html">running a piece</a> from Google&nbsp;CEO&nbsp;Eric Schmidt on how Google can help newspapers. It&#8217;s an interesting read. </p>
<p><strong>Original Article:&nbsp;</strong>Google has created a new web crawler specifically for Google News. What this means is that publishers who do not want Google News to index their content can more easily control that. That also applies to publishers who don&#8217;t wish to completely cut out indexing, but wish to limit/restrict certain elements of their content from being indexed. </p>
<p>Google offers this new crawler at a time when Google&#8217;s relationship with online news is a heavy focus of discussion throughout the industry, with the <a href="http://www.webpronews.com/topnews/2009/12/01/minds-of-the-media-gather-to-discuss-future-of-news">FTC&#8217;s meeting of the media minds</a> taking place. This week <a href="http://www.webpronews.com/topnews/2009/12/01/google-changes-how-it-handles-paid-content">Google already announced some changes</a> to how it handles paid content (by offering a five-article limit for the &quot;first click free&quot; plan). Now the company appears to be further extending its olive branch to concerned publishers (whether or not that will be enough is another discussion). </p>
<p>In the past, publishers have been able to block Google from content via robots.txt and the Robots Extension Protocol (REP). They have also been able to keep content out of Google News and stay in Google Search, by using a contact form provided by Google. Now, Google is making it so publishers don&#8217;t even have to contact them. </p>
<p><img align="right" style="margin: 10px;" title="Josh Cohen" alt="Josh Cohen" src="http://images.ientrymail.com/webpronews/article_pics/josh-cohen.jpg" />&quot;Now, with the news-specific crawler, if a publisher wants to opt out of Google News, they don&#8217;t even have to contact us &#8211; they can put instructions just for user-agent Googlebot-News in the same robots.txt file they have today,&quot; <a href="http://googlenewsblog.blogspot.com/2009/12/same-protocol-more-options-for-news.html">says</a> Google News Senior Business Product Manager Josh Cohen. &quot;In addition, once this change is fully in place, it will allow publishers to do more than just allow/disallow access to Google News. They&#8217;ll also be able to apply the full range of REP directives just to Google News. Want to block images from Google News, but not from Web Search? Go ahead. Want to include snippets in Google News, but not in Web Search? Feel free. All this will soon be possible with the same standard protocol that is REP.&quot;</p>
<p>&quot;While this means even more control for publishers, the effect of opting out of News is the same as it&#8217;s always been,&quot; says Cohen. &quot;It means that content won&#8217;t be in Google News or in the parts of Google that are powered by the News index. For example, if a publisher opts out of Google News, but stays in Web Search, their content will still show up as natural web search results, but they won&#8217;t appear in the block of news results that sometimes shows up in Web Search, called Universal search, since those come from the Google News index.&quot;</p>
<p>Cohen says Google News users shouldn&#8217;t notice any difference in their experience with the service. It will be interesting to see the reaction from disgruntled publishers, and whether or not this will make any significant difference in how they view Google News. </p>
<p>
<strong>Related Articles:</strong></p>
<p><span style="font-family: Arial;"><span style="font-size: larger;">&gt;&nbsp;</span></span><a style="color: rgb(0, 105, 210); text-decoration: underline;" href="http://www.webpronews.com/topnews/2009/12/01/google-changes-how-it-handles-paid-content"><span style="font-family: Arial;"><span style="font-size: larger;">Google Changes How it Handles Paid Content</span></span></a></p>
<p><span style="font-family: Arial;"><span style="font-size: larger;">&gt; </span></span><a style="color: rgb(0, 105, 210); text-decoration: underline;" href="http://www.webpronews.com/topnews/2009/12/01/minds-of-the-media-gather-to-discuss-future-of-news"><span style="font-family: Arial;"><span style="font-size: larger;">Minds of the Media Gather to Discuss Future of News</span></span></a></p>
<p><span style="font-family: Arial;"><span style="font-size: larger;">&gt; </span></span><a style="color: rgb(0, 105, 210); text-decoration: underline;" href="../../../../../../topnews/2009/11/09/google-okay-with-blocking-news-corp"><span style="font-family: Arial;"><span style="font-size: larger;">Google Okay With Blocking News Corp.</span></span></a></p>
<p><span style="font-family: Arial;"><span style="font-size: larger;">&gt; </span></span><a style="color: rgb(0, 105, 210); text-decoration: underline;" href="../../../../../../topnews/2009/11/24/is-the-murdock-bing-deal-really-just-about-the-wall-street-journal"><span style="font-family: Arial;"><span style="font-size: larger;">Is it Really Crazy to Block Google?</span></span></a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.webpronews.com/google-does-more-to-appease-disgruntled-news-publishers-2009-12/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Why Your Robots.txt Blocked URLs May Show up in Google</title>
		<link>http://www.webpronews.com/why-your-robotstxt-blocked-urls-may-show-up-in-google-2009-10</link>
		<comments>http://www.webpronews.com/why-your-robotstxt-blocked-urls-may-show-up-in-google-2009-10#comments</comments>
		<pubDate>Tue, 06 Oct 2009 22:22:18 +0000</pubDate>
		<dc:creator>Chris Crum</dc:creator>
				<category><![CDATA[Search]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[Matt Cutts]]></category>
		<category><![CDATA[Robots.txt]]></category>
		<category><![CDATA[SEO]]></category>
		<category><![CDATA[uncrawled urls]]></category>
		<category><![CDATA[Videos]]></category>
		<category><![CDATA[Webmasters]]></category>

		<guid isPermaLink="false">http://www.webpronews.com/?p=51675</guid>
		<description><![CDATA[<p>Matt Cutts has appeared in yet another Google Webmaster Video, and this time he has a whiteboard with him so he can illustrate what he's talking about. What he's talking about this time are uncrawled URLs in search results. <br />
<br />
Cutts says Google gets a lot of complaints from webmasters who say the search engine is <strong>violating their robots.txt files</strong>, with which they intend to keep Google from crawling certain pages. Sometimes those URLs still end up in search results. <br />
]]></description>
			<content:encoded><![CDATA[<p>Matt Cutts has appeared in yet another Google Webmaster Video, and this time he has a whiteboard with him so he can illustrate what he&#8217;s talking about. What he&#8217;s talking about this time are uncrawled URLs in search results. </p>
<p>Cutts says Google gets a lot of complaints from webmasters who say the search engine is <strong>violating their robots.txt files</strong>, with which they intend to keep Google from crawling certain pages. Sometimes those URLs still end up in search results. </p>
<p>According to Matt, what is happening in most cases is that when someone&#8217;s saying &quot;I blocked example.com/go&quot; in robots.txt, it turns out that the snippet Google returns in search results just brings back a URL with no text for the snippet. The reason for this is that <strong>Google didn&#8217;t actually crawl the page</strong>. </p>
<p>&quot;It did abide by robots.txt. You told us this page is blocked, so we did not fetch this page,&quot; says Matt. It is a URL reference. &quot;We saw a link to it, but we didn&#8217;t fetch the page itself,&quot; he explains.</p>
<p>Google didn&#8217;t actually fetch the page itself, and that&#8217;s why there&#8217;s no text snippet. In case you were wondering what the point of showing them at all is, Cutts breaks out an example looking at the California DMV, whose site is: www.dmv.ca.gov.</p>
<p><center></p>
<table>
<tbody>
<tr>
<td><object height="340" width="560"><param name="movie" value="http://www.youtube.com/v/KBdEwpRQRD0&amp;hl=en&amp;fs=1&amp;" /><param name="allowFullScreen" value="true" /><param name="allowscriptaccess" value="always" /><embed height="340" width="560" src="http://www.youtube.com/v/KBdEwpRQRD0&amp;hl=en&amp;fs=1&amp;" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true"></embed></object></td>
</tr>
</tbody>
</table>
<p></center></p>
<p>Cutts notes that at one point the California Department of Motor Vehicles had a robots.txt that blocked all search engines. &quot;Now these days pretty much every site is savvy enough, you know, at one point the New York Times and eBay and a whole bunch of different sites would use robots.txt,&quot; he says.</p>
<p>If someone searches for &quot;California DMV&quot; in Google, there&#8217;s pretty much only one answer, he says. So that is the answer that Google wants to return. Luckily for Google <strong>a lot of people were linking to that page with the anchor text</strong> &quot;California DMV&quot;. That helps Google be able to return the result without having to crawl the page. </p>
<p>Cutts also says that <strong>they can get descriptions from a directory</strong> like the <a href="http://www.dmoz.org/">Open Directory Project</a> (DMOZ). He cites Nissan and Metallica.com as examples of sites that used to block Google with robots.txt. They had been listed in the Open Directory Project, however, and Google went and got the information from there to include as the snippet. </p>
<p>When this type of thing happens, it looks like the page was crawled, when in fact it wasn&#8217;t. &quot;So we are able to return something that can be very helpful to users without violating robots.txt by not crawling that page,&quot; says Cutts.</p>
<p>He also notes that when you don&#8217;t want pages to show up, you can use the &quot;noindex&quot; meta tag at the top of the page. When Google sees this tag, it drops the page from its search results completely. Another option is the <a href="https://www.google.com/webmasters/tools/removals?pli=1">URL removal tool</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.webpronews.com/why-your-robotstxt-blocked-urls-may-show-up-in-google-2009-10/feed</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>White House Unblocks Google</title>
		<link>http://www.webpronews.com/white-house-unblocks-google-2009-01</link>
		<comments>http://www.webpronews.com/white-house-unblocks-google-2009-01#comments</comments>
		<pubDate>Sat, 24 Jan 2009 23:58:47 +0000</pubDate>
		<dc:creator>WebProNews Staff</dc:creator>
				<category><![CDATA[Life]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[Politics]]></category>
		<category><![CDATA[Robots.txt]]></category>
		<category><![CDATA[whitehouse.gov]]></category>

		<guid isPermaLink="false">http://www.webpronews.com/?p=48423</guid>
		<description><![CDATA[<p>If the outgoing Bush Administration was thought to run a secretive, bubble-icious type of White House, the Obama Administration so far is proving to be the opposite. The Whitehouse.gov redesign for greater transparency has already been widely noted&#8212;Presidential blog and all&#8212;but the website is now much more open to a new kind of visitor: the search engine spider. <br />]]></description>
			<content:encoded><![CDATA[<p>If the outgoing Bush Administration was thought to run a secretive, bubble-icious type of White House, the Obama Administration so far is proving to be the opposite. The Whitehouse.gov redesign for greater transparency has already been widely noted&mdash;Presidential blog and all&mdash;but the website is now much more open to a new kind of visitor: the search engine spider. </p>
<p>On Monday, Whitehouse.gov was still <a href="http://www.kottke.org/09/01/the-countrys-new-robotstxt-file">blocking search engine access</a> to a tremendous amount of website information. In all, the robots.txt file used the &ldquo;Disallow&rdquo; command 2,400 times, blocking search engine access to information on earmarks, African American history, photo essays from various places and events, first lady initiatives, the budget, defense, on and on.</p>
<p><center><img border="0" style="margin: 4px;" src="http://images.ientrymail.com/webpronews/article_pics/disallow-whitehouse.jpg" alt="White House Unblocks Google" title="White House Unblocks Google" /></center>
<p>Obviously, if posted on the White House website, none of this information would be considered classified, or even sensitive, so it&rsquo;s unclear why Bush&rsquo;s web crew felt the need to prevent the site from being searchable. </p>
<p>Regardless, all search crawler barriers were removed with the Bushes&rsquo; furniture, the &ldquo;Disallow&rdquo; command lines reduced from 2,400 to basically none. </p>
<p>Requests for comment and/or explanations from prior and current administrations were not returned. Meanwhile, it appears President Obama will be able to <a href="http://marcambinder.theatlantic.com/archives/2009/01/obama_will_get_his_blackberry.php">keep his Blackberry</a> after all&mdash;with some super-encryption functionality added to it.&nbsp; <br />&nbsp;</p>
<p>&nbsp;</p>
]]></content:encoded>
			<wfw:commentRss>http://www.webpronews.com/white-house-unblocks-google-2009-01/feed</wfw:commentRss>
		<slash:comments>14</slash:comments>
		</item>
		<item>
		<title>Search Engines Indexing Google Profiles</title>
		<link>http://www.webpronews.com/search-engines-indexing-google-profiles-2008-10</link>
		<comments>http://www.webpronews.com/search-engines-indexing-google-profiles-2008-10#comments</comments>
		<pubDate>Sun, 19 Oct 2008 14:26:34 +0000</pubDate>
		<dc:creator>Navneet Kaushal</dc:creator>
				<category><![CDATA[Search]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[google profiles]]></category>
		<category><![CDATA[Robots.txt]]></category>

		<guid isPermaLink="false">http://www.webpronews.com/?p=47375</guid>
		<description><![CDATA[<p>It is true that Google Profiles serve as the main building blocks in acting as the foundation for making Google as the main social map in the whole web-world!</p> <p>&#34;A Google Profile is simply how you represent yourself on Google products &#8212; it lets you tell others a bit more about who you are and what you're all about. You control what goes into your Google Profile, sharing as much (or as little) as you'd like.&#34; - Google</p>]]></description>
			<content:encoded><![CDATA[<p>It is true that Google Profiles serve as the main building blocks in acting as the foundation for making Google as the main social map in the whole web-world!</p>
<p>&quot;A Google Profile is simply how you represent yourself on Google products &mdash; it lets you tell others a bit more about who you are and what you&#8217;re all about. You control what goes into your Google Profile, sharing as much (or as little) as you&#8217;d like.&quot; &#8211; Google</p>
<p>The news is that <a href="http://googlesystem.blogspot.com/2007/12/google-profiles.html" onclick="javascript:urchinTracker('/outbound/googlesystem.blogspot.com/2007/12/google-profiles.html?ref=http_//www.google.com/reader/view/');"><u>Google Profiles</u></a> are now being indexed by search engines!</p>
<p>It is because of the fact that Google has added a new line into their robots.txt file and has lifted the nonindex style command for these files.</p>
<p><strong><a href="http://blogs.zdnet.com/Google/?p=1158" onclick="javascript:urchinTracker('/outbound/blogs.zdnet.com/Google/?p=1158?ref=http_//www.google.com/reader/view/');"><u>Garett Rogers</u></a> writes :</strong></p>
<p>Just about a half hour ago, Google added a new line into their robots.txt file which makes all those profiles (or at least 50,000 of them) crawlable by search engines. The new entry tells search engines to use &ldquo;<a href="http://www.gstatic.com/s2/sitemaps/profiles-sitemap.xml" onclick="javascript:urchinTracker('/outbound/www.gstatic.com/s2/sitemaps/profiles-sitemap.xml?ref=http_//www.google.com/reader/view/');"><u>http://www.gstatic.com/s2/sitemaps/profiles-sitemap.xml</u></a>&rdquo; as a sitemap. The sitemap looks something like this:</p>
<p><a href="http://www.gstatic.com/s2/sitemaps/sitemap-000.txt" onclick="javascript:urchinTracker('/outbound/www.gstatic.com/s2/sitemaps/sitemap-000.txt?ref=http_//www.google.com/reader/view/');"><u>http://www.gstatic.com/s2/sitemaps/sitemap-000.txt</u></a><br /> 2008-10-15</p>
<p><a href="http://www.gstatic.com/s2/sitemaps/sitemap-001.txt" onclick="javascript:urchinTracker('/outbound/www.gstatic.com/s2/sitemaps/sitemap-001.txt?ref=http_//www.google.com/reader/view/');"><u>http://www.gstatic.com/s2/sitemaps/sitemap-001.txt</u></a><br /> 2008-10-15</p>
<p>According to Garett, Google would soon launch a &ldquo;People Onebox&rdquo; at one point. Google has earlier performed this task for local and books as well.</p>
<p><a href="http://www.pagetrafficblog.com/google-profiles-indexable-search-engines/5444/">Comments</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.webpronews.com/search-engines-indexing-google-profiles-2008-10/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Google Indexing Sites in 1 Day Again</title>
		<link>http://www.webpronews.com/google-indexing-sites-in-1-day-again-2008-01</link>
		<comments>http://www.webpronews.com/google-indexing-sites-in-1-day-again-2008-01#comments</comments>
		<pubDate>Mon, 07 Jan 2008 20:31:35 +0000</pubDate>
		<dc:creator>Michael Jensen</dc:creator>
				<category><![CDATA[Search]]></category>
		<category><![CDATA[AdWords]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[Indexing]]></category>
		<category><![CDATA[Robots]]></category>
		<category><![CDATA[Robots.txt]]></category>
		<category><![CDATA[spiders]]></category>

		<guid isPermaLink="false">http://www.webpronews.com/?p=43120</guid>
		<description><![CDATA[<p>I created a new site on Friday, and by Saturday exactly 24 hours later it was in Google&#8217;s Index. I posted about this just over a month ago in my post, <a href="http://www.soloseo.com/blog/2007/11/26/7-steps-to-get-your-new-site-indexed-in-24-hours/" title="7 Steps to Get Your New Site Indexed in 24 Hours">7 Steps to Get Your New Site Indexed in 24 Hours</a>. <br /><br />I had a lot of comments about whether or not Adwords was necessary, so I thought I&#8217;d try it again without running Adwords this time. Here&#8217;s how it all played out:</p>]]></description>
			<content:encoded><![CDATA[<p>I created a new site on Friday, and by Saturday exactly 24 hours later it was in Google&rsquo;s Index. I posted about this just over a month ago in my post, <a href="http://www.soloseo.com/blog/2007/11/26/7-steps-to-get-your-new-site-indexed-in-24-hours/" title="7 Steps to Get Your New Site Indexed in 24 Hours">7 Steps to Get Your New Site Indexed in 24 Hours</a>. </p>
<p>I had a lot of comments about whether or not Adwords was necessary, so I thought I&rsquo;d try it again without running Adwords this time. Here&rsquo;s how it all played out:</p>
<p>1) I created 5 pages of content (Home, FAQ, About Us, etc.).</p>
<p>2) I put them in a simple template with site-wide links. I also linked to it from one of my other sites (it&rsquo;s very relevant so it makes sense).</p>
<p>3) I added tagged the site on only 2 social bookmarking sites.</p>
<p>4) Commented in 1 forum, put the URL in one directory (niche specific), and submitted it to Digg.</p>
<p>5) Installed Google Analytics</p>
<p>6) Created a sitemap, <a href="http://www.soloseo.com/blog/2007/04/12/how-to-configure-sitemap-autodiscovery-in-robots-txt/" title=", pinged Google, and put the sitemap in my Robots.txt.">pinged Google, and put the sitemap in my Robots.txt</a>. Logged into Google Webmaster Central and submitted my sitemap there.</p>
<p>When I checked exactly 24 hours later I was sitting in the index and had already begun to get a few visitors from Google.</p>
<p>I had previously done Google adwords out of both necessity (get quick traffic) but also because of the trust factor I believe it gives to Google, and the fact that Google integrates a quality factor into their quality score (so they come to your site and look at it). Obviously this is just one test compared to several others I&rsquo;ve done with Adwords, but it seems its very possible without running some ads.</p>
<p>Anyone else seeing 24 hour indexing for new sites?</p>
<p><a href="http://www.soloseo.com/blog/2008/01/07/24-hour-indexing-new-sites/#respond">Comments</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.webpronews.com/google-indexing-sites-in-1-day-again-2008-01/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>
