Google Obtains Similarity Engine Patent

    January 4, 2007

This week, Google was awarded a patent for technology designed to address duplicate content issues throughout the index. The patent, originally filed in December of 2001, is entitled “Methods and Apparatus for Estimating Similarity.”

Google Obtains Similarity Engine Patent
Does Google’s New Patent Doom Dupe Content?

Duplicate content continues to be a thorn in the side of search users the world over. I can’t tell you how many times I come across multiple copies of the same article when I’m sifting through Google’s blog search. Scraper sites are as rampant as ever, and the duplicative dilemma when it comes to content only seems to be getting worse.

A recently approved patent, however, brings to light steps that Google is taking to deal with the duplicate content issue. The patent abstract reads as follows:

A similarity engine generates compact representations of objects called sketches. Sketches of different objects can be compared to determine the similarity between the two objects. The sketch for an object may be generated by creating a vector corresponding to the object, where each coordinate of the vector is associated with a corresponding weight.

The weight associated with each coordinate in the vector is multiplied by a predetermined hashing vector to generate a product vector, and the product vectors are summed. The similarity engine may then generate a compact representation of the object based on the summed product vector.

I’m pretty savvy when it comes to techno-babble. I can explain what a Heisenberg Compensator is meant for when it comes to a Star Trek transporter relay, but all this talk of vectors and objects eludes me.

Luckily, there are people around like Bruce Clay, who can sift through the geek speak and get to the heart of the matter. This is Bruce’s summary of the patented technology’s impact:

Doing this means Google will be able to streamline its indexing process and help to reduce the amount of duplicate content on the Web. It also means if you’re not careful with your breadcrumb navigation, using dynamic URLs or implementing any of the other techniques commonly associated with duplicate content issues, you may find all your search engine optimization dollars officially wasted when Google decides not to index your site.

Wow, it looks like SEOs had better take good notice of what this patent could mean to the indexing process.

As for me, I will just be happy if the sites that scrape WebProNews articles are unceremoniously booted from the index in similar fashion to the way that LSU dispatched Notre Dame in the Sugar Bowl.

Add to | Digg | Reddit | Furl

Joe is a staff writer for WebProNews. Visit WebProNews for the latest ebusiness news.