Quantcast

How Much Of Google’s Webspam Efforts Come From These Patents?

Old patents may play role in new algorithm update

Get the WebProNews Newsletter:
How Much Of Google’s Webspam Efforts Come From These Patents?
[ Search]

Bill Slawski over at SEO By The Sea, who is always up on search industry patents, has an interesting article talking about a patent that might be related to Google’s new Webspam Update.

It’s called: Methods and systems for identifying manipulated articles. The abstract for the patent says:

Systems and methods that identify manipulated articles are described. In one embodiment, a search engine implements a method comprising determining at least one cluster comprising a plurality of articles, analyzing signals to determine an overall signal for the cluster, and determining if the articles are manipulated articles based at least in part on the overall signal.

The patent was filed all the way back in 2003 and was awarded in 2007. Of course, the new update is really based on principles Google has held for years. The update is designed to target violators of its quality guidelines.

Patent jargon makes my head hurt, and I’m willing to bet there’s a strong possibility you don’t want to sift through this whole thing. Slawski is a master at explaining these things, so I’ll just quote him from his piece.

“There are a couple of different elements to this patent,” he writes. “One is that a search engine might identify a cluster of pages that might be related to each other in some way, like being on the same host, or interlinked by doorway pages and articles targeted by those pages. Once such a cluster is identified, documents within the cluster might be examined for individual signals, such as whether or not the text within them appears to have been generated by a computer, or if meta tags are stuffed with repeated keywords, if there is hidden text on pages, or if those pages might contain a lot of unrelated links.”

He goes on to talk about many of the improvements Google has made to its infrastructure, and spam detecting technologies. He also notes that two phrase-based patents were granted to Google this week. One is for “Phrase extraction using subphrase scoring” and the other, “Query phrasification“. The abstracts for those, are (respectively):

An information retrieval system uses phrases to index, retrieve, organize and describe documents. Phrases are extracted from the document collection. Documents are the indexed according to their included phrases, using phrase posting lists. The phrase posting lists are stored in an cluster of index servers. The phrase posting lists can be tiered into groups, and sharded into partitions. Phrases in a query are identified based on possible phrasifications. A query schedule based on the phrases is created from the phrases, and then optimized to reduce query processing and communication costs. The execution of the query schedule is managed to further reduce or eliminate query processing operations at various ones of the index servers.

And…

An information retrieval system uses phrases to index, retrieve, organize and describe documents. Phrases are extracted from the document collection. Documents are the indexed according to their included phrases, using phrase posting lists. The phrase posting lists are stored in an cluster of index servers. The phrase posting lists can be tiered into groups, and sharded into partitions. Phrases in a query are identified based on possible phrasifications. A query schedule based on the phrases is created from the phrases, and then optimized to reduce query processing and communication costs. The execution of the query schedule is managed to further reduce or eliminate query processing operations at various ones of the index servers.

If you’re really interested in tech patents and the inner-workings of how search engines work, I’d suggest reading Slawski’s post. I’d also suggest watching Matt Cutts explain how Google Search works.

How Much Of Google’s Webspam Efforts Come From These Patents?
Top Rated White Papers and Resources
  • http://www.pandacode.com Stefan

    I can confirm that pages that have been promoted using blog networks were hit hardest. It is these blog networks that use ‘spun’ articles. Google might call them ‘manipulated’ articles.
    Big authority domains are on the rise – even with crappy content…