<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>WebProNews &#187; Web Crawlers</title>
	<atom:link href="http://www.webpronews.com/tag/web-crawlers/feed" rel="self" type="application/rss+xml" />
	<link>http://www.webpronews.com</link>
	<description>Breaking News in Tech, Search, Social, &#38; Business</description>
	<lastBuildDate>Mon, 13 Feb 2012 15:43:44 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>New &#8220;Bingbot&#8221; Will Crawl Non-optimized Sites More Easily</title>
		<link>http://www.webpronews.com/new-bingbot-will-crawl-non-optimized-sites-more-easily-2010-06</link>
		<comments>http://www.webpronews.com/new-bingbot-will-crawl-non-optimized-sites-more-easily-2010-06#comments</comments>
		<pubDate>Mon, 28 Jun 2010 19:05:05 +0000</pubDate>
		<dc:creator>Chris Crum</dc:creator>
				<category><![CDATA[Search]]></category>
		<category><![CDATA[Bing]]></category>
		<category><![CDATA[Bingbot]]></category>
		<category><![CDATA[Microsoft]]></category>
		<category><![CDATA[search engines]]></category>
		<category><![CDATA[SEO]]></category>
		<category><![CDATA[Web Crawlers]]></category>

		<guid isPermaLink="false">http://www.webpronews.com/?p=54471</guid>
		<description><![CDATA[<p>Microsoft has announced that it will be bringing the Bing web crawler out of beta on October 1st. It will be rebranded as &#34;the Bingbot&#34; and replace the existing msnbot. &#34;It will still honor robots.txt directives written for msnbot, so no change is required to robots.txt file(s),&#34; a Bing representative tells WebProNews.<br />
<br />
&#34;Improvements to the bot enable more efficient crawling, and increase the ability to crawl content on sites not optimized for search,&#34; he says.<br />
&#160;]]></description>
			<content:encoded><![CDATA[<p>Microsoft has announced that it will be bringing the Bing web crawler out of beta on October 1st. It will be rebranded as &quot;the Bingbot&quot; and replace the existing msnbot. &quot;It will still honor robots.txt directives written for msnbot, so no change is required to robots.txt file(s),&quot; a Bing representative tells WebProNews.</p>
<p>&quot;Improvements to the bot enable more efficient crawling, and increase the ability to crawl content on sites not optimized for search,&quot; he says.<br />
&nbsp;<br />
<img align="right" src="http://images.ientrymail.com/webpronews/article_pics/robot.jpg" alt="Robot - This is not the real Bingbot, but it will be here in October." title="Robot - This is not the real Bingbot, but it will be here in October." style="margin: 10px;" />Rick DeJarnette has <a href="http://www.bing.com/community/blogs/webmaster/archive/2010/06/28/bing-crawler-bingbot-on-the-horizon.aspx">more about the change</a> on the Bing Webmaster Blog:</p>
<p><em>Instead of the old msnbot 2.0b showing up in your server logs, the updated user agent will be:  </em></p>
<p><em><span style="font-family: Courier New;">Mozilla/5.0 (compatible; bingbot/2.0 +http://www.bing.com/bingbot.htm)</span></em></p>
<p><em>The HTTP header From field will also change as shown below:</em></p>
<p><em><span style="font-family: Courier New;">From: msnbot(at)microsoft.com</span></em></p>
<p><em>will become</em></p>
<p><em><span style="font-family: Courier New;">From: bingbot(at)microsoft.com</span></em></p>
<p>If Bing finds separate sets of directives for Bingbot and for other crawlers, directives for bingbot will take precedence, the company says. </p>
<p>I find the part about increasing the ability to crawl content on sites not optimized for search to be particularly interesting. I wouldn&#8217;t exactly call this an invitation to ignore SEO. Obviously Google is still the biggest search engine anyway, but even as far as Bing is concerned, good SEO practices will likely still help your rankings. </p>
<p>Also keep in mind that optimizing for Bing is becoming increasingly important. Not only is <a href="http://www.webpronews.com/topnews/2010/06/25/likes-mean-relevance-in-facebook-search">Facebook giving more reason for people to search</a> (where Bing provides the web results), but the <a href="http://www.webpronews.com/topnews/2010/06/16/time-to-start-placing-more-emphasis-on-bing-seo">Yahoo/Bing integration</a> will be here (likely) before the holidays.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.webpronews.com/new-bingbot-will-crawl-non-optimized-sites-more-easily-2010-06/feed</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Google States Case for Online News in WSJ</title>
		<link>http://www.webpronews.com/google-does-more-to-appease-disgruntled-news-publishers-2009-12</link>
		<comments>http://www.webpronews.com/google-does-more-to-appease-disgruntled-news-publishers-2009-12#comments</comments>
		<pubDate>Thu, 03 Dec 2009 18:33:00 +0000</pubDate>
		<dc:creator>Chris Crum</dc:creator>
				<category><![CDATA[Search]]></category>
		<category><![CDATA[Crawlers]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[google news]]></category>
		<category><![CDATA[News]]></category>
		<category><![CDATA[news search]]></category>
		<category><![CDATA[Online News]]></category>
		<category><![CDATA[Robots.txt]]></category>
		<category><![CDATA[Web Crawlers]]></category>

		<guid isPermaLink="false">http://www.webpronews.com/?p=52281</guid>
		<description><![CDATA[<p><strong>Update:&#160;</strong>The Wall Street Journal is <a href="http://online.wsj.com/article/SB10001424052748704107104574569570797550520.html">running a piece</a> from Google&#160;CEO&#160;Eric Schmidt on how Google can help newspapers. It's an interesting read. <br />
]]></description>
			<content:encoded><![CDATA[<p><strong>Update:&nbsp;</strong>The Wall Street Journal is <a href="http://online.wsj.com/article/SB10001424052748704107104574569570797550520.html">running a piece</a> from Google&nbsp;CEO&nbsp;Eric Schmidt on how Google can help newspapers. It&#8217;s an interesting read. </p>
<p><strong>Original Article:&nbsp;</strong>Google has created a new web crawler specifically for Google News. What this means is that publishers who do not want Google News to index their content can more easily control that. That also applies to publishers who don&#8217;t wish to completely cut out indexing, but wish to limit/restrict certain elements of their content from being indexed. </p>
<p>Google offers this new crawler at a time when Google&#8217;s relationship with online news is a heavy focus of discussion throughout the industry, with the <a href="http://www.webpronews.com/topnews/2009/12/01/minds-of-the-media-gather-to-discuss-future-of-news">FTC&#8217;s meeting of the media minds</a> taking place. This week <a href="http://www.webpronews.com/topnews/2009/12/01/google-changes-how-it-handles-paid-content">Google already announced some changes</a> to how it handles paid content (by offering a five-article limit for the &quot;first click free&quot; plan). Now the company appears to be further extending its olive branch to concerned publishers (whether or not that will be enough is another discussion). </p>
<p>In the past, publishers have been able to block Google from content via robots.txt and the Robots Extension Protocol (REP). They have also been able to keep content out of Google News and stay in Google Search, by using a contact form provided by Google. Now, Google is making it so publishers don&#8217;t even have to contact them. </p>
<p><img align="right" style="margin: 10px;" title="Josh Cohen" alt="Josh Cohen" src="http://images.ientrymail.com/webpronews/article_pics/josh-cohen.jpg" />&quot;Now, with the news-specific crawler, if a publisher wants to opt out of Google News, they don&#8217;t even have to contact us &#8211; they can put instructions just for user-agent Googlebot-News in the same robots.txt file they have today,&quot; <a href="http://googlenewsblog.blogspot.com/2009/12/same-protocol-more-options-for-news.html">says</a> Google News Senior Business Product Manager Josh Cohen. &quot;In addition, once this change is fully in place, it will allow publishers to do more than just allow/disallow access to Google News. They&#8217;ll also be able to apply the full range of REP directives just to Google News. Want to block images from Google News, but not from Web Search? Go ahead. Want to include snippets in Google News, but not in Web Search? Feel free. All this will soon be possible with the same standard protocol that is REP.&quot;</p>
<p>&quot;While this means even more control for publishers, the effect of opting out of News is the same as it&#8217;s always been,&quot; says Cohen. &quot;It means that content won&#8217;t be in Google News or in the parts of Google that are powered by the News index. For example, if a publisher opts out of Google News, but stays in Web Search, their content will still show up as natural web search results, but they won&#8217;t appear in the block of news results that sometimes shows up in Web Search, called Universal search, since those come from the Google News index.&quot;</p>
<p>Cohen says Google News users shouldn&#8217;t notice any difference in their experience with the service. It will be interesting to see the reaction from disgruntled publishers, and whether or not this will make any significant difference in how they view Google News. </p>
<p>
<strong>Related Articles:</strong></p>
<p><span style="font-family: Arial;"><span style="font-size: larger;">&gt;&nbsp;</span></span><a style="color: rgb(0, 105, 210); text-decoration: underline;" href="http://www.webpronews.com/topnews/2009/12/01/google-changes-how-it-handles-paid-content"><span style="font-family: Arial;"><span style="font-size: larger;">Google Changes How it Handles Paid Content</span></span></a></p>
<p><span style="font-family: Arial;"><span style="font-size: larger;">&gt; </span></span><a style="color: rgb(0, 105, 210); text-decoration: underline;" href="http://www.webpronews.com/topnews/2009/12/01/minds-of-the-media-gather-to-discuss-future-of-news"><span style="font-family: Arial;"><span style="font-size: larger;">Minds of the Media Gather to Discuss Future of News</span></span></a></p>
<p><span style="font-family: Arial;"><span style="font-size: larger;">&gt; </span></span><a style="color: rgb(0, 105, 210); text-decoration: underline;" href="../../../../../../topnews/2009/11/09/google-okay-with-blocking-news-corp"><span style="font-family: Arial;"><span style="font-size: larger;">Google Okay With Blocking News Corp.</span></span></a></p>
<p><span style="font-family: Arial;"><span style="font-size: larger;">&gt; </span></span><a style="color: rgb(0, 105, 210); text-decoration: underline;" href="../../../../../../topnews/2009/11/24/is-the-murdock-bing-deal-really-just-about-the-wall-street-journal"><span style="font-family: Arial;"><span style="font-size: larger;">Is it Really Crazy to Block Google?</span></span></a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.webpronews.com/google-does-more-to-appease-disgruntled-news-publishers-2009-12/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Yahoo Slurps Somewhere Else</title>
		<link>http://www.webpronews.com/yahoo-slurps-somewhere-else-2007-06</link>
		<comments>http://www.webpronews.com/yahoo-slurps-somewhere-else-2007-06#comments</comments>
		<pubDate>Thu, 07 Jun 2007 17:44:23 +0000</pubDate>
		<dc:creator>WebProNews Staff</dc:creator>
				<category><![CDATA[Search]]></category>
		<category><![CDATA[blog]]></category>
		<category><![CDATA[SEO]]></category>
		<category><![CDATA[server logs]]></category>
		<category><![CDATA[Web Crawlers]]></category>
		<category><![CDATA[Webmasters]]></category>
		<category><![CDATA[WebMasterWorld]]></category>
		<category><![CDATA[Yahoo]]></category>
		<category><![CDATA[Yahoo Slurp]]></category>

		<guid isPermaLink="false">http://www.webpronews.com/?p=38284</guid>
		<description><![CDATA[<p>The migration of the Slurp is complete, says Yahoo. Over the past few weeks, the search engine has been transitioning its crawler, dubbed (disgustingly) &#34;Slurp,&#34; to a new address at crawl.yahoo.net. Adjust your server logs as necessary and join the curmudgeons who are unimpressed. <br />
]]></description>
			<content:encoded><![CDATA[<p>The migration of the Slurp is complete, says Yahoo. Over the past few weeks, the search engine has been transitioning its crawler, dubbed (disgustingly) &quot;Slurp,&quot; to a new address at crawl.yahoo.net. Adjust your server logs as necessary and join the curmudgeons who are unimpressed. <br />
<span id="more-38284"></span> <br />
Too little (or too much, by some complaints), too late, it would seem. </p>
<p>The <a title="Yahoo Slurps" href="http://www.ysearchblog.com/archives/000460.html">Yahoo Search Blog</a> reads: </p>
<blockquote><p><em>&#8230;all machines crawling as Slurp are now in crawl.yahoo.net. You can see this change in your web server logs, where the page accesses from inktomisearch.com are being fully replaced by crawl.yahoo.net contacts. Note that this does not cover other Yahoo! crawlers, such Yahoo! China, and other verticals, like Yahoo! Shopping, Yahoo! Travel, etc., which have their own user-agent.</em></p>
<p><em>Don&#8217;t fret though; there is no need to change your robots.txt file because the crawler user-agent is still Yahoo! Slurp. If you use IP based filtering, there is no need to change that either, since the IP addresses from which we crawl remain the same. However, please ensure that your network or firewall setup does not keep crawl.yahoo.net out as we won&#8217;t be able to include your content in our results.</em></p>
</blockquote>
<p>Be sure to click that link to get more enumerated information. </p>
<p>Over at <a title="WebmasterWorld" href="http://www.webmasterworld.com/yahoo_search/3359251.htm">WebmasterWorld</a>, the crowd is a bit mixed about it (but only a bit), the loudest complaint, from &quot;IncrediBill,&quot; who notes not only is it the move a year-and-a-half too late, but has gone overboard.</p>
<blockquote><p>
<em>Why do we need to allow an army of Yahoo spiders to redundantly abuse our servers? </em></p>
<p><em>Is it a conceptual problem that Yahoo can&#8217;t share pages already downloaded? </em></p>
<p><em>When I posed that question to one of their engineers I was given a lame excuse that the various crawlers had different needs&hellip;.</em></p>
<p><em>Funny, Google managed to make some of their crawlers share CACHE, so we know it can be done.</em></p>
</blockquote>
<p>Negativity often rings louder and truer than other things, but there is at least one voice in that forum who thinks Yahoo&#8217;s update is &quot;a small evolutionary improvement above&quot; Google.</p>
<p>Even if in our hearts, we know that&#8217;s not true. <img src='http://www.webpronews.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> <br />
&nbsp;
</p></p>
]]></content:encoded>
			<wfw:commentRss>http://www.webpronews.com/yahoo-slurps-somewhere-else-2007-06/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Webmaster Claims Spider Entered Contract In Suit</title>
		<link>http://www.webpronews.com/webmaster-claims-spider-entered-contract-in-suit-2007-03</link>
		<comments>http://www.webpronews.com/webmaster-claims-spider-entered-contract-in-suit-2007-03#comments</comments>
		<pubDate>Fri, 16 Mar 2007 22:37:34 +0000</pubDate>
		<dc:creator>WebProNews Staff</dc:creator>
				<category><![CDATA[Business]]></category>
		<category><![CDATA[Copyright]]></category>
		<category><![CDATA[Copyright law]]></category>
		<category><![CDATA[Eric Goldman]]></category>
		<category><![CDATA[Spider]]></category>
		<category><![CDATA[Suzanne Shell]]></category>
		<category><![CDATA[Wayback Machine]]></category>
		<category><![CDATA[Web Crawlers]]></category>
		<category><![CDATA[Webmaster]]></category>

		<guid isPermaLink="false">http://www.webpronews.com/?p=36221</guid>
		<description><![CDATA[<p>The Web and artificial intelligence have brought about some surreal, science fiction like questions. The most recent mind-bending concept is whether or not robots can enter into contracts &#8211; that is, is a Web crawler implicitly entering a contract posted on a website announcing copyright conditions? <br />
]]></description>
			<content:encoded><![CDATA[<p>The Web and artificial intelligence have brought about some surreal, science fiction like questions. The most recent mind-bending concept is whether or not robots can enter into contracts &ndash; that is, is a Web crawler implicitly entering a contract posted on a website announcing copyright conditions? </p>
<p>A little while back, we explored the idea that RSS, as an automatic distribution agent, could <a href="http://www.webpronews.com/insiderreports/2006/11/03/does-rss-imply-permission-to-reuse-content">imply permission</a> to republish. But that involves two human parties, essentially, with a technical agent in between. </p>
<p>A court battle in Colorado, however, focuses on claims brought by Suzanne Shell against the Internet Archive&#8217;s Wayback Machine, which holds in searchable perpetuity pages that appear on the Web, for future historical reference. </p>
<p>Shell owns the website www.profane-justice.org, devoted to providing information and support for people who feel they&#8217;ve been unlawfully targeted by state agents (like police or child services organizations) or unfairly accused of child abuse.</p>
<p>A notice appears on the site stating that users copying or distributing the content on the site automatically agree to the terms of a contract. Failure to abide carries a fee of $5,000 per page copied; $250,000 per occurrence of unauthorized use, and a charge of $50,000 for each occurrence of failure to pay, plus costs and triple damages. </p>
<p>There was no mechanism in place on the site (such as in the <a href="http://www.webpronews.com/topnews/2007/02/26/controlling-how-your-site-is-indexed">robots.txt file</a>) to prevent Internet Archive&#8217;s robot from scanning, copying, and storing the pages on Shell&#8217;s website. She discovered that the Wayback Machine had reproduced the contents of her website about 87 times in five years, &quot;and displayed her entire website to the public daily during that period.&quot;&nbsp; </p>
<p>Shell sued the company for conversion, civil theft, breach of contract, and violations of the Racketeering Influence and Corrupt Organizations Act (RICO) and the Colorado Organized Crime Control Act (COCCA). </p>
<p>As might be guessed, most of these claims were dismissed. But the breach of contract claim is still under consideration, awaiting more information. The question that will be decided, ultimately, is whether a web crawler that is not blocked by a website can legally be bound by a contract posted on the site. </p>
<p>The outcome of that question could also have important impact on Web-crawling and Internet copyright law itself. Search engines like Google have leaned on Fair Use principles when scouring the Web (and off-line libraries) for information. The Google Book Search project defenders have claimed that publishers can &quot;<a href="http://www.webpronews.com/topnews/2007/02/26/controlling-how-your-site-is-indexed">opt out</a>&quot; of their network by letting Google know their desire to do so. </p>
<p>Of a similar vein, then, is Web crawling. Webmasters must opt-out of indexing via the robots.txt file, preventing the spider from crawling the site. If the courts continue to back search engines&#8217; and other Internet companies&#8217; right to copy and index at will, then the whole copyright system, by default, it would seem, goes opt-out, too. </p>
<p>Copyright law blogger John Ottaviani, on <a href="http://blog.ericgoldman.org/archives/2007/03/can_a_spider_en.htm">Eric Goldman</a>&#8216;s blog, goes into the <a href="http://blog.ericgoldman.org/archives/waybackshell.pdf">Internet Archive v. Shell</a> case and its implications in greater detail.&nbsp;&nbsp;</p>
<p>&nbsp;</p></p>
]]></content:encoded>
			<wfw:commentRss>http://www.webpronews.com/webmaster-claims-spider-entered-contract-in-suit-2007-03/feed</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
	</channel>
</rss>

<!-- Performance optimized by W3 Total Cache. Learn more: http://www.w3-edge.com/wordpress-plugins/

Page Caching using memcached
Database Caching 1/21 queries in 0.010 seconds using memcached
Object Caching 366/420 objects using memcached

Served from: webpronews.com @ 2012-02-13 10:56:09 -->
