Publishers Push ACAP As Robots.txt Improvement
The Automated Content Access Protocol (ACAP) debuted today as a set of improvements to deficiencies seen in the robots.txt protocol currently observed by search crawlers.
Google’s courtroom bete noire Agence France-Presse, and business publishers Reed Elsevier and John Wiley & Sons, are among those who developed ACAP. The need to control content while making it available to search engine users drove the development of this protocol, which is complementary to the Robots Exclusion Protocol found in robots.txt files.
“To date, many aggregation websites have chosen to adopt a liberal attitude to copyright – ‘it’s OK until someone tells us it isn’t’ – which means there is an enormous amount of infringing material being hosted by major companies,” said the group.
That “liberal attitude” has been Google’s regular contention about not just online content, but that of its book search project. Google’s book scanning has angered publishers due to the search giant’s position that publishers need to actively opt-out, rather than Google proactively seeking permission to index.
Publishers consider robots.txt too simplistic with its allow/disallow choices for spiders and directories or types of content. “These simple choices are inconsistently interpreted,” ACAP claimed; search engines will likely find that opinion a surprising one.
ACAP extends what robots.txt presents to crawlers, in a standard form. For example, a time limit value can be defined, telling the crawler the publisher wants certain content to expire and be removed from an index after a given date or period of time.
From a cursory review of the technical documentation, ACAP looks like a way for publishers to establish usage guidelines that they can utilize in lawsuits against search engines. Though publishing groups have backed the standard, major search engines like Google and Yahoo are not represented in ACAP’s supporters.
This could be a step toward gearing up for a fight with the search engines to force them to comply with ACAP as they crawl available online content. We won’t be surprised if the search engines respond by disallowing the spidering of any ACAP-enhanced sites until everyone reaches a public common ground where they accept ACAP as an extension of the robots.txt standard.