Yahoo SiteExplorer Web vs. API

    June 12, 2007

In response to my post about the Yahoo API giving the “wrong” results, I got an email from a Yahoo! rep, and we’ve been emailing back and forth a few times since. When I showed him the difference in the numbers given through the API and the Web interface for (I’ve updated my domain-info tool to both scrape the web interface and get the numbers through the API), saying “they can’t both be accurate” he explained the difference this way:

Nope, not going to claim they are accurate, merely an estimate taken from either the raw (for the scraped pages) or semi-analyzed (for the API data) for the server cluster you hit at the time of request.

If it were possible to return accurate numbers, I’m willing to bet they’d do that. Unfortunately, it’s usually not, due to stuff like scaling issues, crawl vs. report lag and other factors.

Followed by another nice quote at the end of that email:

Again, I’m not claiming that these number are the best possible (even as estimates, that’s why the engineers are trying to improve them), but they do serve as a guide. Likewise, I’d definitely make sure to grab numbers from Google, Ask and MSN since decision making off of one data point seldom makes for good decisions.

Now I think this points is great, were it not that the data these other 3 engines give are either stupidly off (in the case of Google and ASK) or non-existent at the moment (MSN).

He says something else too:

To be honest, the only person that can accurately measure real inbound link counts are the folks that control the access logs and can scan and report those. Anything outside of those numbers is never going to be as accurate.

Now this would be true, if all scrapers gave me clickthroughs… Yet they don’t. So I think I can get a nice sample of links which truely have value from access logs, but it wouldn’t show me any DMOZ links for instance. Another problem is of course, that your competitor probably won’t give you access to his access logs… So we need interfaces like these. The two different numbers now each have their inherent value because of these answers, so for now I’m going to keep using the API and scrape them.

I must say though, that it’s awesome to be able to mail with a rep from Yahoo! about this and discuss it so openly, and them having no problem at all with me blogging this.