<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>WebProNews &#187; Crawlers</title>
	<atom:link href="http://www.webpronews.com/tag/crawlers/feed" rel="self" type="application/rss+xml" />
	<link>http://www.webpronews.com</link>
	<description>Breaking News in Tech, Search, Social, &#38; Business</description>
	<lastBuildDate>Mon, 13 Feb 2012 04:32:37 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Google News Now Using Googlebot for Crawling</title>
		<link>http://www.webpronews.com/google-news-googlebot-2011-08</link>
		<comments>http://www.webpronews.com/google-news-googlebot-2011-08#comments</comments>
		<pubDate>Thu, 25 Aug 2011 21:30:29 +0000</pubDate>
		<dc:creator>Chris Crum</dc:creator>
				<category><![CDATA[Search]]></category>
		<category><![CDATA[Crawlers]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[google news]]></category>
		<category><![CDATA[Robots]]></category>

		<guid isPermaLink="false">http://www.webpronews.com/?p=74466</guid>
		<description><![CDATA[Google announced today that it will no longer be using a separate crawler for Google News, and will now start using Googlebot. &#8220;Google News recently updated our infrastructure to crawl with Google’s primary user-agent,Googlebot. What does this mean? Very little &#8230;]]></description>
			<content:encoded><![CDATA[<p>Google announced today that it will no longer be using a separate crawler for Google News, and will now start using Googlebot. </p>
<p>&#8220;Google News recently updated our infrastructure to crawl with Google’s primary user-agent,Googlebot. What does this mean? Very little to most publishers,&#8221; <a href="http://googlewebmastercentral.blogspot.com/2011/08/google-news-now-crawling-with-googlebot.html">says</a> Google News Product Specialist David Smydra. &#8220;Any news organizations that wish to opt out of Google News can continue to do so: Google News will still respect the robots.txt entry for Googlebot-News, our former user-agent, if it is more restrictive than the robots.txt entry for Googlebot.&#8221;</p>
<p>&#8220;Although you’ll now only see the Googlebot user-agent in your site’s logs, no need to worry: the appearance of Googlebot instead of Googlebot-News is independent of our inclusion policies,&#8221; says Smydra. &#8220;You can always check whether your site is included in Google News by searching with the “site:” operator. For instance, enter “site:yournewssite.com” in the search field for Google News, and if you see results then we are currently indexing your news site.&#8221;</p>
<p>As far as analytics, you&#8217;ll still be able to differentiate traffic from Google Search and traffic from Google News, Google says. </p>
<p>Sites using Google&#8217;s metered subscription model or the first click free model won&#8217;t have to make any changes, but sites that require registration, payment or login before reading the full article, Google News will only be able to crawl and index the title and snippet that&#8217;s shown on the page. </p>
<p>Google stresses that the change will not affect how it crawls your News sitemaps. </p>
<p>More info in the <a href="http://www.google.com/support/news_pub/">help center</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.webpronews.com/google-news-googlebot-2011-08/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Google States Case for Online News in WSJ</title>
		<link>http://www.webpronews.com/google-does-more-to-appease-disgruntled-news-publishers-2009-12</link>
		<comments>http://www.webpronews.com/google-does-more-to-appease-disgruntled-news-publishers-2009-12#comments</comments>
		<pubDate>Thu, 03 Dec 2009 18:33:00 +0000</pubDate>
		<dc:creator>Chris Crum</dc:creator>
				<category><![CDATA[Search]]></category>
		<category><![CDATA[Crawlers]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[google news]]></category>
		<category><![CDATA[News]]></category>
		<category><![CDATA[news search]]></category>
		<category><![CDATA[Online News]]></category>
		<category><![CDATA[Robots.txt]]></category>
		<category><![CDATA[Web Crawlers]]></category>

		<guid isPermaLink="false">http://www.webpronews.com/?p=52281</guid>
		<description><![CDATA[<p><strong>Update:&#160;</strong>The Wall Street Journal is <a href="http://online.wsj.com/article/SB10001424052748704107104574569570797550520.html">running a piece</a> from Google&#160;CEO&#160;Eric Schmidt on how Google can help newspapers. It's an interesting read. <br />
]]></description>
			<content:encoded><![CDATA[<p><strong>Update:&nbsp;</strong>The Wall Street Journal is <a href="http://online.wsj.com/article/SB10001424052748704107104574569570797550520.html">running a piece</a> from Google&nbsp;CEO&nbsp;Eric Schmidt on how Google can help newspapers. It&#8217;s an interesting read. </p>
<p><strong>Original Article:&nbsp;</strong>Google has created a new web crawler specifically for Google News. What this means is that publishers who do not want Google News to index their content can more easily control that. That also applies to publishers who don&#8217;t wish to completely cut out indexing, but wish to limit/restrict certain elements of their content from being indexed. </p>
<p>Google offers this new crawler at a time when Google&#8217;s relationship with online news is a heavy focus of discussion throughout the industry, with the <a href="http://www.webpronews.com/topnews/2009/12/01/minds-of-the-media-gather-to-discuss-future-of-news">FTC&#8217;s meeting of the media minds</a> taking place. This week <a href="http://www.webpronews.com/topnews/2009/12/01/google-changes-how-it-handles-paid-content">Google already announced some changes</a> to how it handles paid content (by offering a five-article limit for the &quot;first click free&quot; plan). Now the company appears to be further extending its olive branch to concerned publishers (whether or not that will be enough is another discussion). </p>
<p>In the past, publishers have been able to block Google from content via robots.txt and the Robots Extension Protocol (REP). They have also been able to keep content out of Google News and stay in Google Search, by using a contact form provided by Google. Now, Google is making it so publishers don&#8217;t even have to contact them. </p>
<p><img align="right" style="margin: 10px;" title="Josh Cohen" alt="Josh Cohen" src="http://images.ientrymail.com/webpronews/article_pics/josh-cohen.jpg" />&quot;Now, with the news-specific crawler, if a publisher wants to opt out of Google News, they don&#8217;t even have to contact us &#8211; they can put instructions just for user-agent Googlebot-News in the same robots.txt file they have today,&quot; <a href="http://googlenewsblog.blogspot.com/2009/12/same-protocol-more-options-for-news.html">says</a> Google News Senior Business Product Manager Josh Cohen. &quot;In addition, once this change is fully in place, it will allow publishers to do more than just allow/disallow access to Google News. They&#8217;ll also be able to apply the full range of REP directives just to Google News. Want to block images from Google News, but not from Web Search? Go ahead. Want to include snippets in Google News, but not in Web Search? Feel free. All this will soon be possible with the same standard protocol that is REP.&quot;</p>
<p>&quot;While this means even more control for publishers, the effect of opting out of News is the same as it&#8217;s always been,&quot; says Cohen. &quot;It means that content won&#8217;t be in Google News or in the parts of Google that are powered by the News index. For example, if a publisher opts out of Google News, but stays in Web Search, their content will still show up as natural web search results, but they won&#8217;t appear in the block of news results that sometimes shows up in Web Search, called Universal search, since those come from the Google News index.&quot;</p>
<p>Cohen says Google News users shouldn&#8217;t notice any difference in their experience with the service. It will be interesting to see the reaction from disgruntled publishers, and whether or not this will make any significant difference in how they view Google News. </p>
<p>
<strong>Related Articles:</strong></p>
<p><span style="font-family: Arial;"><span style="font-size: larger;">&gt;&nbsp;</span></span><a style="color: rgb(0, 105, 210); text-decoration: underline;" href="http://www.webpronews.com/topnews/2009/12/01/google-changes-how-it-handles-paid-content"><span style="font-family: Arial;"><span style="font-size: larger;">Google Changes How it Handles Paid Content</span></span></a></p>
<p><span style="font-family: Arial;"><span style="font-size: larger;">&gt; </span></span><a style="color: rgb(0, 105, 210); text-decoration: underline;" href="http://www.webpronews.com/topnews/2009/12/01/minds-of-the-media-gather-to-discuss-future-of-news"><span style="font-family: Arial;"><span style="font-size: larger;">Minds of the Media Gather to Discuss Future of News</span></span></a></p>
<p><span style="font-family: Arial;"><span style="font-size: larger;">&gt; </span></span><a style="color: rgb(0, 105, 210); text-decoration: underline;" href="../../../../../../topnews/2009/11/09/google-okay-with-blocking-news-corp"><span style="font-family: Arial;"><span style="font-size: larger;">Google Okay With Blocking News Corp.</span></span></a></p>
<p><span style="font-family: Arial;"><span style="font-size: larger;">&gt; </span></span><a style="color: rgb(0, 105, 210); text-decoration: underline;" href="../../../../../../topnews/2009/11/24/is-the-murdock-bing-deal-really-just-about-the-wall-street-journal"><span style="font-family: Arial;"><span style="font-size: larger;">Is it Really Crazy to Block Google?</span></span></a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.webpronews.com/google-does-more-to-appease-disgruntled-news-publishers-2009-12/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Google reminds webmasters about robot invasion</title>
		<link>http://www.webpronews.com/google-reminds-webmasters-about-robot-invasion-2008-03</link>
		<comments>http://www.webpronews.com/google-reminds-webmasters-about-robot-invasion-2008-03#comments</comments>
		<pubDate>Fri, 28 Mar 2008 16:37:38 +0000</pubDate>
		<dc:creator>WebProNews Staff</dc:creator>
				<category><![CDATA[Crawlers]]></category>
		<category><![CDATA[robots.txt]]></category>
		<category><![CDATA[search engines]]></category>
		<category><![CDATA[Webmasters]]></category>

		<guid isPermaLink="false">http://blogs.webpronews.com/2008/03/28/google-reminds-webmasters-about-robot-invasion/</guid>
		<description><![CDATA[Users of Google&#8217;s Webmaster Central tools have access to an effective module for creating robots.txt files for their sites. Robots.txt holds the key for site publishers to gain proper indexing of their content by search engines. Scrupulous crawlers obey the &#8230;]]></description>
			<content:encoded><![CDATA[<p>Users of Google&#8217;s Webmaster Central tools have access to an effective module for creating robots.txt files for their sites.</p>
<p><span id="more-66832"></span></p>
<p>Robots.txt holds the key for site publishers to gain proper indexing of their content by search engines. Scrupulous crawlers obey the tenets of robots.txt, spidering what the file allows them to do and avoiding paths disallowed by the webmaster.</p>
<p><a href="http://googlewebmastercentral.blogspot.com/2008/03/speaking-language-of-robots.html">Google posted a note</a> about their robots.txt generator, housed in their Webmaster Tools. It permits the creation of blanket robots.txt files, or ones with more granular designations about robots and where certain ones can and cannot go.</p>
<p>Of course, a knowledge of robots.txt syntax and a few minutes in vi does the same thing, but there may be webmasters who prefer the comfort of a clean graphical interface to a text editor.</p>
<p>Google also has the advantage of a robots.txt analyzer in Webmaster Tools. This allows site publishers to test out the file and see if any of its contents could be problematic for arriving spiders.</p>
<p>They also noted a couple of caveats about robots.txt. First, not every search engine supports all of the possible extensions to the robots.txt standard. Second, there are unscrupulous crawlers that will ignore the file and grab whatever they can. Sensitive content should be either password protected if it needs to be online.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.webpronews.com/google-reminds-webmasters-about-robot-invasion-2008-03/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Publishers Push ACAP As Robots.txt Improvement</title>
		<link>http://www.webpronews.com/publishers-push-acap-as-robots-txt-improvement-2007-11</link>
		<comments>http://www.webpronews.com/publishers-push-acap-as-robots-txt-improvement-2007-11#comments</comments>
		<pubDate>Thu, 29 Nov 2007 18:46:15 +0000</pubDate>
		<dc:creator>WebProNews Staff</dc:creator>
				<category><![CDATA[Search]]></category>
		<category><![CDATA[ACAP]]></category>
		<category><![CDATA[Crawlers]]></category>
		<category><![CDATA[Publisher]]></category>
		<category><![CDATA[Publishers]]></category>
		<category><![CDATA[Robots.txt]]></category>
		<category><![CDATA[Search Engine]]></category>

		<guid isPermaLink="false">http://www.webpronews.com/?p=42254</guid>
		<description><![CDATA[The Automated Content Access Protocol (ACAP) debuted today as a set of improvements to deficiencies seen in the robots.txt protocol currently observed by search crawlers.
]]></description>
			<content:encoded><![CDATA[<p>The Automated Content Access Protocol (ACAP) debuted today as a set of improvements to deficiencies seen in the robots.txt protocol currently observed by search crawlers.<br />
<span id="more-42254"></span><br />
Google&#8217;s courtroom <i>bete noire</i> Agence France-Presse, and business publishers Reed Elsevier and John Wiley &#038; Sons, are among those who developed <a href=http://www.the-acap.org/>ACAP</a>. The need to control content while making it available to search engine users drove the development of this protocol, which is complementary to the Robots Exclusion Protocol found in robots.txt files.</p>
<p>
&#8220;ACAP will give content owners the confidence to allow search engines to index their content under clear terms of use,&#8221; the organization behind ACAP&#8217;s development said in their FAQ. </p>
<p>
&#8220;To date, many aggregation websites have chosen to adopt a liberal attitude to copyright &#8211; &#8216;it&#8217;s OK until someone tells us it isn&#8217;t&#8217; &#8211; which means there is an enormous amount of infringing material being hosted by major companies,&#8221; said the group. </p>
<p>
That &#8220;liberal attitude&#8221; has been Google&#8217;s regular contention about not just online content, but that of its book search project. Google&#8217;s book scanning has angered publishers due to the search giant&#8217;s position that publishers need to actively opt-out, rather than Google proactively seeking permission to index.</p>
<p>
Publishers consider robots.txt too simplistic with its allow/disallow choices for spiders and directories or types of content. &#8220;These simple choices are inconsistently interpreted,&#8221; ACAP claimed; search engines will likely find that opinion a surprising one.</p>
<p>
ACAP extends what robots.txt presents to crawlers, in a standard form. For example, a time limit value can be defined, telling the crawler the publisher wants certain content to expire and be removed from an index after a given date or period of time.</p>
<p>
From a cursory review of the technical documentation, ACAP looks like a way for publishers to establish usage guidelines that they can utilize in lawsuits against search engines. Though publishing groups have backed the standard, major search engines like Google and Yahoo are not represented in ACAP&#8217;s supporters.</p>
<p>
This could be a step toward gearing up for a fight with the search engines to force them to comply with ACAP as they crawl available online content. We won&#8217;t be surprised if the search engines respond by disallowing the spidering of any ACAP-enhanced sites until everyone reaches a public common ground where they accept ACAP as an extension of the robots.txt standard.</p>
<p>
<small></small></p>
<p>
<a href=http://twitter.com/dutter/>follow me on Twitter</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.webpronews.com/publishers-push-acap-as-robots-txt-improvement-2007-11/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>SES &#8211; Meet The Crawlers</title>
		<link>http://www.webpronews.com/ses-meet-the-crawlers-2007-08</link>
		<comments>http://www.webpronews.com/ses-meet-the-crawlers-2007-08#comments</comments>
		<pubDate>Fri, 24 Aug 2007 18:57:09 +0000</pubDate>
		<dc:creator>Navneet Kaushal</dc:creator>
				<category><![CDATA[Search]]></category>
		<category><![CDATA[Crawlers]]></category>
		<category><![CDATA[SES]]></category>

		<guid isPermaLink="false">http://www.webpronews.com/?p=40012</guid>
		<description><![CDATA[<p>Representatives from major crawler-based search engines cover how to submit and feed them content, with plenty of Q&#38;A time to cover issues related to ranking well and being indexed.</p>
<!--sessj07-->]]></description>
			<content:encoded><![CDATA[<p>Representatives from major crawler-based search engines cover how to submit and feed them content, with plenty of Q&amp;A time to cover issues related to ranking well and being indexed.</p>
<p><!--sessj07--><span id="more-40012"></span></p>
<p>Moderator:</p>
<ul>
<li>Danny Sullivan, Conference Co-Chair, Search Engine Strategies San Jose</li>
</ul>
<p>Speakers:</p>
<ul>
<li>Peter Linsley, Sr. Product Manager, Ask.com</li>
<p></p>
<li>Evan Roseman, Software Engineer, Google, Inc.</li>
<p></p>
<li>Sean Suchter of Yahoo! Search.</li>
<p></p>
<li>Eytan Seidman, Microsoft</li>
</ul>
<p>First to speak is <strong>Eytan Seidman</strong> from Microsoft. He shows a presentation on Microsoft&#8217;s Live Webmaster Portal which explains how Microsoft&#8217;s crawler will index your site. Live Webmaster Portal supports map submissions and one can also view their website&#8217;s statistics. Microsoft has many search engine crawlers and all their names begin with &quot;MSNBot&quot; -</p>
<ul>
<li>web search</li>
<p></p>
<li>news</li>
<p></p>
<li>academic</li>
<p></p>
<li>multimedia</li>
<p></p>
<li>user agent</li>
</ul>
<p>Microsoft also supports &quot;NOODP&quot; and &quot;NOCACHE&quot; tags.</p>
<p>Next is yahoo! Search&#8217;s <strong>Sean Suchter</strong> who also has a presentation about Yahoo&#8217;s crawler.</p>
<p>dynamic URL rewriting via Site Explorer &quot;Robots-nocontent&quot; tag. Yahoo! employs crawler load improvements (reduction and targeting). The new Yahoo! search engine crawler targets better and has a comparatively low volume.</p>
<p>Google&#8217;s Evan Roseman steps up to explain and discuss webmaster central&#8217;s features. He recommends taking advantage of Webmaster central&#8217;s submit a site option so that Google&#8217;s search engien crawler can index all your content.</p>
<p>Next up is Ask.com&#8217;s Peter Linsley who discusses catering to the search engine robot as many times in catering to the actual human visitor, the robot is forgotten. Some problems include requiring cookies. He points out that Ask does accept site map submissions but points out that they&#8217;d rather be able to crawl naturally.</p>
<p>Peter uses the Adobe site to demonstrate some issues that they may have with multiple domains and duplicate content. He then uses the Mormon.org site and shows that they are disallowing crawlers to index the root page. This creates problems with crawling.</p>
<p><strong>Q &amp; A</strong></p>
<ul>
<li>Q: First question is for the Google rep. Wants to know whether they will allow users to see supplemental results within Webmaster Central now that they are no longer tagging them in search results.</li>
<p></p>
<li>A: Evan stated that being in supplemental is not a penalty but did not provide a definite answer as to whether they would allow users to discover if or not results are supplemental.
<p>Danny interjects that all engines have a two-tier system and Eytan, Sean and Peter confirmed that. So&hellip; they all have supplemental indices but people only seem to be concerned with Google&#8217;s, most likely because they used to identify them as such in the regular search results.</p>
</li>
<p></p>
<li>Q: What can a competitor actually do if anything to hurt your site?</li>
<p></p>
<li>A: Evan says that there is a possibility where a competitor could hurt your site but did say it is extremely difficult. Hacking, domain hi-jacking are some of the things that can occur.</li>
<p></p>
<li>Q: Question relates to scenario when you re-publish content to places such as eBay but the sites you re-publish to rank better than original. How can a webmaster identify original source of information?</li>
<p></p>
<li>A: Peter answers that one could try to get places they republish content to use robots.txt to block spidering of content. Another thing to do is have link back to original site. However on a site such as eBay, that is not always possible. The response to that is to create unique content for these sites that this person is re-publishing content on.</li>
<p></p>
<li>Q: Robert Carlton asks if all engines are moving towards having things like Webmaster Centrals. Also asks how they treat 404s and 410s.</li>
<p></p>
<li>A: As for 404s and 410s, Ask, Google and Yahoo! treat them the same. Robert points out that they should treat them differently as a 410 indicates the file is gone whereas 404 is an error.</li>
<p></p>
<li>Q: Question regarding getting content crawled more frequently.</li>
<p></p>
<li>A: Evan suggest to use the Site Map feature in Webmaster Central and keep it up to date. He also suggest promoting it by placing a link to it on the home page of their site.</li>
<p></p>
<li>Q: How can one use site maps more effective for very larges site that have information changing on a regular basis? Also inquired how to get more pages indexed when only a portion are being indexed.</li>
<p></p>
<li>A: Submitting a site map with Google is not going to cause other URLs to not be crawled. Evan also points that they are not going to be able to crawl and include ALL the pages that are out there. Again suggests that webmaster promote them such as listing them on home page. However when dealing with hundreds of thousands of pages, that is not always feasible.</li>
<p></p>
<li>Q: How do engines interpret things like AJAX, JavaScript, etc.?</li>
<p></p>
<li>A: Eytan answered that if webmaster wants things interpreted, they are going to have to represent those in a format the engine can understand, AJAX and JavaScript currently not being one of them.</li>
<p></p>
<li>Q: Question regarding rankings in Yahoo! disappearing for three weeks but then they get back in. Is his due to an update?</li>
<p></p>
<li>A: Sean answers that it certainly could be and suggests using Site Explorer to see if there is some kind of issue.</li>
<p></p>
<li>Q: How many links will engines actually crawl per page? How much is too much?</li>
<p></p>
<li>A: Peter says there is no hard and fast rule but keep the end user in mind. Evan echoes the same feeling.</li>
<p></p>
<li>Q: Do the engine use meta descriptions?</li>
<p></p>
<li>A: All engines use them and may use them if the algorithm feels they are relevant.</li>
<p></p>
<li>Q: For sites that are designed completely in Flash, can you use content in a &quot;noscript&quot; tag or would that be considered as some type of cloaking?</li>
<p></p>
<li>A: Sean said IP delivery is a no-no but if the content is the same as Flash, he&#8217;d rather see content in noscript than traditional cloaking. Evan suggests avoiding sites in complete Flash but rather use Flash components.</li>
<p></p>
<li>Q: Is meta keywords tag still relevant?</li>
<p></p>
<li>A: Microsoft &#8211; no, Yahoo! &#8211; not really, Google &#8211; not really, and Ask &#8211; not really. All read it but it is has so little bearing. For a really obscure keyword where it only appears in the keyword tag and no where else on the web, Yahoo! and Ask are the only ones that will show a search result based on it.</li>
<p></p>
<li>Q: How do engines view automated submission/ranking software?</li>
<p></p>
<li>A: Evan &#8211; don&#8217;t use them.</li>
</ul>
<p><a href="http://www.unofficialseoblog.com/meet-the-crawlers-ses-san-jose-2007/2530/" title="Comment on SES">Comments</a></p>
<p>Tag: </p>
]]></content:encoded>
			<wfw:commentRss>http://www.webpronews.com/ses-meet-the-crawlers-2007-08/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Truth About Web Crawlers</title>
		<link>http://www.webpronews.com/truth-about-web-crawlers-2006-04</link>
		<comments>http://www.webpronews.com/truth-about-web-crawlers-2006-04#comments</comments>
		<pubDate>Thu, 20 Apr 2006 13:37:23 +0000</pubDate>
		<dc:creator>Maksym Nesen</dc:creator>
				<category><![CDATA[Technology]]></category>
		<category><![CDATA[Crawlers]]></category>
		<category><![CDATA[Web]]></category>

		<guid isPermaLink="false">http://www.webpronews.com/?p=28651</guid>
		<description><![CDATA[Wouldn't it be nice to be able to leave some code in your web site to tell the search engine spider crawlers to make your site number one?
]]></description>
			<content:encoded><![CDATA[<p>Wouldn&#8217;t it be nice to be able to leave some code in your web site to tell the search engine spider crawlers to make your site number one?</p>
<p>Unfortunately a robots.txt file or robots meta tag won&#8217;t do that, but they can help the crawlers to index your site better and block out the unwanted ones. First a little definition explaining:</p>
<p>Search Engine Spiders or Crawlers &#8211; A web crawler (also known as web spider) is a program which browses the World Wide Web in a methodical, automated manner. Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine, that will index the downloaded pages to provide fast searches.</p>
<p>A web crawler is one type of bot, or software agent. In general, it starts with a list of URLs to visit. As it visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit, recursively browsing the Web according to a set of policies.</p>
<p>Robots.txt &#8211; The robots exclusion standard or robots.txt protocol is a convention to prevent well-behaved web spiders and other web robots from accessing all or part of a website. The information specifying the parts that should not be accessed is specified in a file called robots.txt in the top-level directory of the website.</p>
<p>The robots.txt protocol is purely advisory, and relies on the cooperation of the web robot, so that marking an area of your site out of bounds with robots.txt does not guarantee privacy. Many web site administrators have been caught out trying to use the robots file to make private parts of a website invisible to the rest of the world. However the file is necessarily publicly available and is easily checked by anyone with a web browser.</p>
<p>The robots.txt patterns are matched by simple substring comparisons, so care should be taken to make sure that patterns matching directories have the final &#8216;/&#8217; character appended: otherwise all files with names starting with that substring will match, rather than just those in the directory intended.</p>
<p>Meta Tag &#8211; Meta tags are used to provide structured data about data.</p>
<p>In the early 2000s, search engines veered away from reliance on Meta tags, as many web sites used inappropriate keywords, or were keyword stuffing to obtain any and all traffic possible.</p>
<p>Some search engines, however, still take Meta tags into some consideration when delivering results. In recent years, search engines have become smarter, penalizing websites that are cheating (by repeating the same keyword several times to get a boost in the search ranking). Instead of going up rankings, these websites will go down in rankings or, on some search engines, will be kicked off of the search engine completely.</p>
<p>Index a site &#8211; The act of crawling your site and gathering information.</p>
<p>How can the robots.txt file and meta tag help you?</p>
<p>In the robots.txt you can tell the harmful &#8216;web crawlers&#8217; to leave your web site alone, and give helpful hints to the ones you want to crawl your site. Below is an example on how to disallow a web crawler to search your site:</p>
<p><code># this identifies the wayback machine User-agent: ia_archiver Disallow: / </code></p>
<p>ia_archiver is the crawler name for the wayback machine that you may have heard of, and the / after disallow tells ia_archiver not to index any of your site. The # allows you to write comments to yourself so you can keep track of what you typed.</p>
<p>Type the above three lines into notepad from your computer and save it to the root directory of your web site as robots.txt. Web crawlers look for this document first at a web site before doing anything else. This helps the crawler to do its job, and helps the web site owner tell the spider what to do. Say for instance you have some data that you don&#8217;t want the crawlers to see. (Like duplicate content for other browser referrer pages)</p>
<p>You can deter crawlers from indexing the &#8216;duplicate&#8217; directory by typing this into your robots.txt file. </p>
<p><code>User-agent: * Disallow: /duplicate/ </code></p>
<p>The * after user-agent says that this action applies to all crawlers and /duplicate/ after disallow tells all crawlers to ignore this directory and not search it. For each user-agent and disallow line there must be a blank space between them in order for it to function correctly. So this is how you would create the above two commands into a robots.txt file:</p>
<p><code># this identifies the wayback machine User-agent: ia_archiver </p>
<p>Disallow: / </p>
<p>User-agent: * Disallow: /duplicate/ </code></p>
<p>One thing to note that is very important: Anyone can access the robots.txt file of a site. So if you have information that you don&#8217;t want anyone to see don&#8217;t include it into the robots.txt file. If the directory that you don&#8217;t want anyone to see is not linked to from your web site the crawlers won&#8217;t index it anyway.</p>
<p>An alternative to blocking indexing of your site is to put a meta tag into the page. It looks like this:</p>
<p>You put this into the tag of your web page. This line tells the robot crawlers not to index (search) the page and not to follow any of the hyperlinks on the page. So as an example tells the robot crawlers to not index the page, but follow the hyperlinks on this page.</p>
<p><b>Did You Know That Google Has Its Own Meta Tag?</b></p>
<p>It looks like this: . This tells the Google robot crawler not to index the page, not to follow any of the links, and not to keep from storing cached versions of your web site. You will want this done if you update the content on your site frequently. This prevents the web user from seeing outdated content that isn&#8217;t refreshed because of storage in the cache.</p>
<p>You can use the meta tag to specifically talk to Google&#8217;s robots to avoid complications or if you are optimizing your site for Google&#8217;s search engine. Recommended software tools to automate submitting and link creation : &#8220;<a href="http://blog-submitter.cafe150.com" class="bluelink">http://blog-submitter.cafe150.com</a>&#8221; &#8211; Blogs AutoFiller</p>
<p>Maksym Nesen is leading programmer of the Oksima team. he developed great product &#8211; Blogs Auto Filler which saves time and money for people who used to blogs advertising</p>
<p>http://blog-submitter.cafe150.com</p>
]]></content:encoded>
			<wfw:commentRss>http://www.webpronews.com/truth-about-web-crawlers-2006-04/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The Robot Generation &#8211; Crawlers Prevail!</title>
		<link>http://www.webpronews.com/the-robot-generation-crawlers-prevail-2005-12</link>
		<comments>http://www.webpronews.com/the-robot-generation-crawlers-prevail-2005-12#comments</comments>
		<pubDate>Wed, 28 Dec 2005 14:50:53 +0000</pubDate>
		<dc:creator>Martin Lemieux</dc:creator>
				<category><![CDATA[Business]]></category>
		<category><![CDATA[Crawlers]]></category>

		<guid isPermaLink="false">http://www.webpronews.com/?p=25407</guid>
		<description><![CDATA[In the pursuit of online power, manipulation, and control, a new breed of web crawler's hiding in the dark alleys of the web are being brought forth.
]]></description>
			<content:encoded><![CDATA[<p>In the pursuit of online power, manipulation, and control, a new breed of web crawler&#8217;s hiding in the dark alleys of the web are being brought forth.</p>
<p>A new breed of crawler&#8217;s are coming to a site near you!</p>
<p>Many higher caliber web development teams are spawning off a breed of crawlers that seem to index your valuable and copywrighted content for their own use and not for the searchers themselves.</p>
<p>I&#8217;m talking about millions of pages online that are being generated in order to attract more search engines and visitors to their site.</p>
<p>Have you ever noticed a portion of your content listed on some obscure page that has 50 other companies listed with yours and are evenly stuffed with the same key words???</p>
<p>Everywhere you search, you&#8217;re bound to come across one of these pages. Some would say that this is great for their business by being listed within these pages but I truly believe that this style of indexing hurts our industry as a whole.</p>
<p>This is nothing short of spam and keyword stuffing!</p>
<p>I mean, what are we teaching the newcomers to the search marketing industry? That this style of &#8220;FFA Pages&#8221; is acceptable and in order to attract 1000&#8242;s of visitors, you too should stuff a ton of pages with someone else&#8217;s content!</p>
<p>For a long time FFA pages (Free For All) were a thing of the past. But, in the midst of battle, a new breed of FFA has sprung up and taken the search marketing industry by storm.</p>
<p>I call this new breed; &#8220;TFFA&#8221; &#8211; Targeted Free For All!</p>
<p>These TFFA pages can be found on some of the most prominent websites world wide. It&#8217;s not just the little guys that indulge themselves and hide behind this form of search marketing spam.</p>
<p>Unfortunately, the industry is turning a blind eye!</p>
<p>Hey, if the big guys are doing it, then it&#8217;s ok!? NO! On the other hand, until these TFFA pages become more and more exposed as nothing short of spam tactics, we will continue to see our personal and valuable content being ripped right out from underneath our noses.</p>
<p>Martin Lemieux is the owner of Smartads. We help companies like yourself to market your business online and offline.</p>
<p>For more of Martin&#8217;s articles, go here: <a href="http://www.smartads.info/newsletter/archive">http://www.smartads.info/newsletter/archive</a></p>
<p>The Martin Report &#8211; eBusiness News!: <a href="http://www.smartads.info/the-martin-report/">http://www.smartads.info/the-martin-report/</a></p>
<p>SEM XML Feed: <a href="http://www.article99.com/rss-feeds/Search-Engine-Marketing.xml">http://www.article99.com/rss-feeds/Search-Engine-Marketing.xml</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.webpronews.com/the-robot-generation-crawlers-prevail-2005-12/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Break It Down: Search Engine Crawlers</title>
		<link>http://www.webpronews.com/break-it-down-search-engine-crawlers-2005-09</link>
		<comments>http://www.webpronews.com/break-it-down-search-engine-crawlers-2005-09#comments</comments>
		<pubDate>Wed, 21 Sep 2005 18:46:41 +0000</pubDate>
		<dc:creator>John Stith</dc:creator>
				<category><![CDATA[Search]]></category>
		<category><![CDATA[Crawlers]]></category>
		<category><![CDATA[Engine]]></category>

		<guid isPermaLink="false">http://www.webpronews.com/?p=23242</guid>
		<description><![CDATA[How do they find those wonderful sites? A question many interested in the internet search world often ponder is also answered fairly easy. It's all in the way they crawl.
]]></description>
			<content:encoded><![CDATA[<p>How do they find those wonderful sites? A question many interested in the internet search world often ponder is also answered fairly easy. It&#8217;s all in the way they crawl.</p>
<p>Rob Sullivan put together a delightful article at <a href="http://www.searchenginejournal.com/index.php?p=2230">SearchEngineJournal</a> on the background and current life of search engine crawlers. He covers their wee beginnings at MIT up to the hordes crawling the sprawling internet we know and love today.</p>
<p>	He also tries to explain a few things regarding what people might see when the do searches based on their particular topic:</p>
<p>You may also notice, upon reviewing your reports, that crawlers like Googlebot will visit repeatedly and request the same page(s) repeatedly. This is common as crawlers also want to be sure the site is stable and also to measure the page&#8217;s change frequency.</p>
<p>	He goes on to explain crawler behavior and how they work, including certain behavior patterns and even breaks it down somewhat based on individual companies like Yahoo or AskJeeves. </p>
<p>	If you want some incite on how crawlers work, this is an excellent article.</p>
<p>John Stith is a staff writer for WebProNews covering technology and business. </p>
]]></content:encoded>
			<wfw:commentRss>http://www.webpronews.com/break-it-down-search-engine-crawlers-2005-09/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Crawler Insights From Google and Yahoo!</title>
		<link>http://www.webpronews.com/crawler-insights-from-google-and-yahoo-2004-03</link>
		<comments>http://www.webpronews.com/crawler-insights-from-google-and-yahoo-2004-03#comments</comments>
		<pubDate>Mon, 08 Mar 2004 19:21:12 +0000</pubDate>
		<dc:creator>Garrett French</dc:creator>
				<category><![CDATA[Search]]></category>
		<category><![CDATA[Cloaking]]></category>
		<category><![CDATA[Crawler]]></category>
		<category><![CDATA[Crawlers]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[Web]]></category>
		<category><![CDATA[Yahoo]]></category>

		<guid isPermaLink="false">http://www.webpronews.com/?p=9086</guid>
		<description><![CDATA[My favorite sessions at the SES conference were those where Google and Yahoo appeared on the same panels.  You could almost always count on some crackling tension between the two search giants.

The "Meet The Crawlers" session was no exception.
]]></description>
			<content:encoded><![CDATA[<p>My favorite sessions at the SES conference were those where Google and Yahoo appeared on the same panels.  You could almost always count on some crackling tension between the two search giants.</p>
<p>The &#8220;Meet The Crawlers&#8221; session was no exception.</p>
<p>Join the <a href="http://www.webproworld.com/viewtopic.php?p=80884">discussion on web crawlers</a> here.</p>
<table width="350" border="0" cellspacing="0" cellpadding="0">
<tr>
<td align="center"><img src="http://images.ientrymail.com/webpronews/crawl.jpg" class="irImage" alt="Weaving the Web" title="Weaving the Web"></td>
</tr>
<tr>
<td align="right" class="caption" style="padding-bottom: 10px; padding-left: 45px; padding-right: 45px;">Weaving the Web</td>
</tr>
<tr>
<td align="center" class="caption" style="padding-bottom: 0px;"><img src="http://images.ientrymail.com/webpronews/salon/complete.gif" width="334" height="21"></td>
</tr>
</table>
<p>Craig Neville-Manning, senior research scientist at Google, had some great advice for webmasters (and a pointed barb for Yahoo &#8211; we&#8217;ll get to that though).</p>
<p>&#8220;We don&#8217;t like pointing users to pages that you change,&#8221; said Craig, describing the practice of cloaking.  You should avoid cloaking at all costs.  As Tim Mayer of Yahoo pointed out during our lunch chat, there&#8217;s a <a href="http://www.webpronews.com/insiderreports/searchinsider/wpn-49-20040305YahooTalksBlackHatOptimizationAndSearchAsMedia.html">legitimate use</a> for every potential spam technique.</p>
<p>For example, you can use cloaking to show search engines an optimized page and then, say, a Flash intensive page when users arrive.  Despite cloaking&#8217;s legitimate uses though, authorities recommend that you don&#8217;t do it.</p>
<p>Concerned you might be cloaking?  Read this <a href="http://www.netdummy.com/netdummy-31-20030731PageCloakingToCloakorNottoCloak.html">page cloaking article</a>.</p>
<p>Craig revealed a good rule of thumb for optimizers &#8211; Google&#8217;s algorithm values text and links that your site visitors can see more highly than anything they can&#8217;t see.  This means focus your efforts more on explicit, helpful, and keyword-focused links, as well as copy that informs your visitors.</p>
<p>For those concerned that the Google bot uses too much bandwidth he mentioned that it can detect when your server is slowing down and it will back off.  Also, the Google bot follows the robots.txt file to the letter.</p>
<p>If you have content you don&#8217;t want the bot to find be sure to put the robots.txt file up to keep it out.  The bot, says Craig, can find content that&#8217;s unlinked.  That&#8217;s right, the Google bot can find single pages dangling unlinked in space.  He didn&#8217;t explain how this happens.</p>
<p>The question and answer session revealed a bit of how the Google looks at keywords in the url.  Someone from the audience asked about how Google views words in the url, whether you should hyphenate them or not for added relevancy and ranking.</p>
<p>Craig said that Google does index words from the url, but they don&#8217;t have as much weight as text links.  He added quickly though that you should not engineer your links for the algorithm &#8211; it&#8217;s better to have your url meaningful to your visitors than use it to affect your ranking.</p>
<p>Tim Mayer of Yahoo seconded this.  He said focusing too much on url engineering can get you into the realm of over-optimization.  &#8220;As a user,&#8221; said Tim, &#8220;if I see a domain with lots of hyphens it&#8217;s usually a low quality site.&#8221;  He advised that you not push your filenames too hard, and that you have intuitive directory structures.</p>
<p>In the Link Building Strategies session, prominent seo guru <a href="http://www.webguerilla.com">Greg Boser</a> said &#8220;hyphenated domains have come and gone.&#8221;</p>
<p>The big talk at this conference was the new Yahoo paid inclusion program, which allows webmasters to pay to show up in Yahoo&#8217;s primary search results.  At the close of Google employee Craig&#8217;s presentation he declared, in a comment obviously leveled at fellow presenter Tim Mayer of Yahoo, &#8220;our search results are not for sale.&#8221;</p>
<p>Tim, ever the gentleman, let it slide.</p>
<p>Garrett French is the editor of iEntry&#8217;s eBusiness channel.  You can talk to him directly at <a href="http://www.webproworld.com">WebProWorld</a>, the eBusiness Community Forum. </p>
]]></content:encoded>
			<wfw:commentRss>http://www.webpronews.com/crawler-insights-from-google-and-yahoo-2004-03/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>

<!-- Performance optimized by W3 Total Cache. Learn more: http://www.w3-edge.com/wordpress-plugins/

Page Caching using memcached
Database Caching 1/45 queries in 0.022 seconds using memcached
Object Caching 566/677 objects using memcached

Served from: webpronews.com @ 2012-02-13 03:19:34 -->
