Newspapers Propose New Indexing Standards
This should come as no surprise from the people who’d sue you to stop you from linking to them: at a publishers’ consortium today, after complaining about the limited nature of robots.txt, the newspaper industry has proposed new standards to prevent search engines (and other sites) from indexing their sites willy-nilly.
This has been in the works since September of last year, when Andy called it like he saw it: “Publishers to Spend Half Million Dollars on a Robots.txt File.” Granted, it’s not just a robots.txt file—but it’s not too far off.
The plan, announced at an NYC gathering of a publishers’ consortium today and known as Automated Content Access Protocol (ACAP), would give publishers more say in what search engines could do. Rather than a simple do/do not index request, publishers could come up with specific rules, such as how long a search engine could retain content, or what links it can follow.
Let’s just make sure we’re all on the same Internet. The three complaints I see in here are already covered by existing methods:
- Index or not: covered by robots.txt, as you mentioned.
- How long to retain content: the unavailable_after META element, announced in July and live as of August (and if you want search engines to wait to index it, don’t publish it or robot it out in robots.txt or a meta tag.)
- What links it can follow: rel="nofollow"
Do you guys always have to do things your way/the hard way? Or are you just too lazy to put in meta elements and rel="nofollow"s?
I’m pretty disappointed that publishers spent all this time and money on trying to impose a standard on search engines that have worked pretty hard to give us the same capabilities already. The fact that ACAP may keep other sites from scraping newspaper sites’ content (yeah, right; they break copyright laws, why would they bother with the standard that you’re imposing on them?) is little consolation. After all, those publishers could have given that money to me.
Best of all, nobody but Exalead has signed on yet. Google’s evaluating the ACAP proposal right now, but I think it’s more than a little presumptuous of the publishing industry to publish their own standard—when Google (et al.) already has a fairly similar set of standards—and expect everyone to bend over backwards to meet their demands.
For a great analysis of the ACAP, ask Danny Sullivan. He says that ACAP isn’t something that’s going to affect the broader market for a while:
So now we have a new standard for expressing search engine permissions. Do site owners need to run out and immediately use it?
No. Not immediately. Not even long term.
Right now, none of the major search engines are supporting ACAP. If you were to use ACAP without ensuring that standard robots.txt or meta robots commands were also included, you’d fail to properly block search engines. Only Exalead, which is not a major multi-country service, would currently act upon your ACAP-only commands.
[But] I think it’s been very useful that some group has diligently and carefully tried to explore the issues, and having ACAP lurking at the very least gives the search engines themselves a kick in the butt to work on better standards. Plus, ACAP provides some groundwork they may want to use. Personally, I doubt ACAP will become Robots.txt 2.0 — but I suspect elements of ACAP will flow into that new version or a successor.
Danny’s a lot more fair, even-handed and nice than I am.