Let the Spider Crawl
For most sites, one of the first things we check is to make sure the site crawler friendly. “Crawler friendly” you say? What the heck does that mean?
Search engines find sites mostly by following links from sites that are already known to find new sites and pages. The sofware programs that search engines use to perform this task are often called “Bots” or “Spiders”. You get the analogy right? “Web”, “Spiders”, “Crawl”.
If you don’t make sure your site is crawable and indexed, then you’re putting your web site at a gross disadvantage. For example, if you have a 1,500 page web site and only 700 pages are getting indexed, that’s like showing up to a baseball game with only 5 of your players. You need the whole team to win, so make sure your site is crawlable and getting indexed properly.
As search engine spiders crawl the links of your site, they make copies of the pages and then peform other functions that strip away the code, interpret the remaining text as well as other analysis that ultimately leads to a score for the page and association of the page to certain words. All of this along with links into your site from other web sites, influence your rankings. On the PPT slides from the recent Google Press Day, it says there are over 200 “signals” used to rank web pages on Google.
Here’s an animation of how Yahoo’s SLURP crawls a network of pages.
If a search engine has difficulty “crawling” the links within your site, then the pages either won’t get indexed at all or will only get partially indexed – neither of which will help your site’s rankings.
OK, now I know why, but what about the how? Search engine friendly URLs are simple. As in, short and simple. For example, the url of this web page is: http://www.toprankblog.com/2006/05/seo-tips-let-the-spider-crawl/
It could be something like http://www.toprankblog.com/?pageid=234234&articleid=5tips&postid=435345 or something similar. The second url is still crawlable, but if you got to pick, which one would you prefer to index? Which one would you be more likely to remember as a user?
Most problems with links and the URLs they point to getting crawled involve shopping cart software or content management systems that place a lot of extra information in the web page URL. If references to “?sid=” or a large number of variables are included in the URL it can cause issues. Search engine bots are leary of “spider traps” or situations with calendars or where an infinite number of url versions display the exact same web page. This often occurs with the use of session ids.
Simple and short urls are typically the easiest to crawl so try to use a content management system that produces short, clean URLs.
You can also use programs like Google Sitemaps to submit your site URLs for inclusion. There is no guarantee it will work, but it’s been pretty effective for many web sites. Google Sitemaps works in conjunction with a normal “crawl” of your web site. Plus there are many useful troubleshooting features and information available with Google Sitemaps. You can also submit an RSS feed or plain text file of your site’s URLs to Yahoo.
There’s actually quite a bit more involved with making your site crawlable, but I’ll leave it at this for now.
Resources on crawler friendly web sites:
- Google Webmaster Guidelines
- Yahoo! Search Content Quality Guidelines
- MSN guidelines
- Ask/Teoma Crawler Information
- Smart IT Consulting Weblog
- Speed Bumps and Walls – SEOMoz
- “Meet the Crawlers” SES NYC 2006
Lee Odden is President and Founder of
TopRank Online Marketing, specializing in organic SEO, blog
marketing and online public relations. He’s been cited as a search
marketing expert by publications including U.S. News & World Report and
The Economist and has implemented successful search marketing programs
with top BtoB companies of all sizes. Odden shares his marketing
expertise at Online Marketing Blog offering
daily news, interviews and best practices.