Google Goes Topical: The Smoking Gun
Uncovering The New Algorithm
There have been all sorts of theories put forward about Google’s new algorithm. What I present here, as far as I can tell, fits the facts at hand. You are welcome to disagree with me.
About PageRank, And Why The Old Google Algorithm Doesn’t Work Any More
The idea of PageRank is that a “random walk” through the web will tell you which sites are the most important. It simulates what would happen if a random surfer followed random links from one page to the next, hitting the back button at dead ends. The higher the PageRank of a page, the more likely it is that the random surfer will come across it.
The way this all works is pretty ingenious, really. The more links to a page, the more likely a random surfer is to find it. Links from more popular pages count for more, because the random surfer is more likely to find one of those links.
PageRank works great for searching collections of research papers within a particular field. For example, if you’re searching a collection of papers (or web pages) about particle physics, the PageRank algorithm will quickly tell you which are the most important (and relevant) papers for a given search query, because those papers will be cited more frequently by other papers.
If the web were all about a single topic, this would work perfectly well. Unfortunately, the web encompasses millions of topics, and in the real world, searchers are not taking a random walk. Searchers are looking for information on a specific topic. The PageRank system rewards all links, regardless of the topic of the page carrying the link.
Google has tried to overcome this limitation by taking the text of the links into account, but savvy search engine marketers have learned to trick Google’s algorithm by planting keyword-laden links all over the web. A cottage industry has grown up around PageRank, and links from “high PageRank” pages can be bartered, bought and sold.
I’ve received offers as high as $3,500 per month to put (utterly irrelevant) text links on the home page of my Inside Out Marketing site, which shows a PageRank of 7 (out of 10) on the Google toolbar. While I am not entertaining these offers, others are actively and systematically pursuing such relationships.
When anyone can achieve high rankings for their pages by buying links from unrelated web sites, or trading links with unrelated sites, PageRank becomes nearly useless in finding quality results for many search queries. When the world’s leading search engine sees the quality of their search results deteriorating, they don’t sit still. We all need to understand that Google needed to do something different.
Meet Your New Best Friend: Topic-Sensitive PageRank
Taher H. Haveliwala, a Ph.D. student at Stanford University, published a very interesting paper in 2002 on “Topic-Sensitive PageRank.” You can read the paper online (http://www2002.org/CDROM/refereed/127/) or download the extended version as a PDF (http://www.stanford.edu/~taherh/papers/topic-sensitive-pagerank-tkde.pdf). Interestingly enough, Mr. Haveliwala went to work at Google in October 2003.
Topic-Sensitive PageRank addresses the problems with the basic PageRank system by adding a “bias” to the random searcher’s random walk. This new random searcher has a clear intent, and is more interested in following relevant links from relevant pages, related to a specific topic. This is a relatively new idea, but one that solves a couple key problems in delivering quality search results.
Mr. Haveliwala is clearly going to be an “impact player” in the search engine world. He has done substantial work in other areas of search technology, including some very interesting studies of how to compute PageRank more efficiently. You can see his published works online from his home page (http://www.stanford.edu/~taherh/papers).
In the original research paper, Haveliwala describes how he used the Stanford WebBase Repository to compute “Topic Sensitive” PageRank scores for 16 topics matching the top-level categories of the Open Directory. Even with a limited set of data (80 million pages) and a limited number of topics, this new method could be seen to improve search results, given an understanding of what topic the searcher was interested in.
In reviewing this paper last year, I noted two problems with applying it to a search engine. As we shall see, both of these problems can now be overcome.
The first problem is expanding the number of topics sufficiently. 16 topics is clearly not enough to produce a major improvement in search results, but the computation of PageRank is very costly, and unless some improvement could be found, it’s unlikely Google could implement this system. There have been significant developments in this area in the past year, and I no longer believe that this is a significant obstacle.
The second problem is determining what the “topic” of a search might be – when the searcher uses the word “bicycle” in a search query, does she want to buy one, or ride one? I will explain shortly how Google might be able to determine the appropriate topic match for a given search query, and demonstrate how this explains why some search queries are affected more than others.
About Applied Semantics & CIRCA
Google acquired a small company called Applied Semantics early in 2003. This company’s technology has already had a significant impact on Google. Among other things, Applied Semantics’ AdSense technology is used to deliver context-based advertising for pay-per-click advertisers on Google’s “Adwords” system. You can see AdSense at work on my content portal (www.insideoutmarketing.com) and many other web sites. The ads change based on the content of the page – pretty cool, right?
Well, AdSense isn’t the only technology that Google picked up with this acquisition. In fact, the underlying technology of AdSense is called CIRCA. I’ll take a short cut here, by quoting from a press release:
“Applied Semantics’ CIRCA Technology is based on a language-independent, highly scalable ontology that consists of millions of words, their meanings, and their conceptual relationships to other meanings in the human language. The ontology, aided by sophisticated search technology, is the basis for a conceptual understanding of the multiplicity of word meanings, enabling computers to more effectively manage and retrieve information which results in improved knowledge discovery opportunities for searchers.”
What CIRCA allows Applied Semantics (and Google) to do, is to identify concepts related to specific words and phrases. They use this technology right now to serve up relevant advertising in a variety of contexts. Applied Semantics technology may also be involved in Google’s keyword stemming system.
Among other things, CIRCA can calculate how closely related or similar “phrase A” is to “concept B.” If you search for “Colorado bicycle trips,” CIRCA can relate that conceptually to a region (Colorado, which is in the Rocky Mountains), to concepts like bicycling and travel, etc. This is important, because it means that they can calculate the “distance” between your search query and various concepts in their database.
Putting It All Together – How To Implement A Topic-Sensitive Search Engine
So now that we know about Topic-Sensitive PageRank and CIRCA, how are they related? In other words, how could Google combine these technologies to produce a better search engine?
Let’s imagine, first of all, that Google has solved the problem of how to calculate Topic-Sensitive PageRank for a large number of topics (or concepts) – perhaps hundreds, maybe thousands. With the old PageRank system, it was important to calculate a very accurate value, but as we shall see, a good fast approximation may be all they need with a topic-sensitive algorithm. Read through some of the published papers and you will see that this is already possible.
Now, take a typical search query like “Colorado bicycle trips.” Those words are going to closely match at least a few topics within the CIRCA database. Based on the “distance” between the search terms used, and the topics in the database, Google could then apply a “topic-sensitive PageRank” score to deliver better search results. The more closely related the search is to a topic, the greater the impact of the topic-sensitive PageRank score.
Because a given search query might match multiple topics, an approximation of the PageRank score could be sufficient to deliver quality results, because any small errors in the PageRank calculation would be averaged out over the various topic-sensitive PageRank scores affecting that query.
If there aren’t any matching topics, Google could still use the good old PageRank system. If there are too many matching topics, they could do the same, although applying multiple topic scores might look a lot like the old system anyway. If the matching topics were only distantly related to the search query, the impact would simply be lessened.
Understanding The Changes, Ignoring The Noise
For some search queries, the results have been radically changed – in a few cases, the top 100 listed pages have all dropped out. The folks at Google Watch have compiled a listing of affected search terms (see www.scroogle.org), and the amount of change in each, which has proven very valuable in conducting parts of our research.
One of the big problems with the available data is that there is a tendency for these radically changed results to be reported more often. Those folks who haven’t seen any change in their Google rankings aren’t complaining, so there’s a bit of a “squeaky wheel” effect at play here.
Most of the goofy conspiracy theories we’ve heard would be expected to show a lot of radically changed results, which is what you see in this “self-reported” data. The reason that it looks this way, though, is because most of the data is coming from people who lost rankings.
Rather than looking at the “self-reported” changes in search results, we’ve taken a different approach, capturing the “most recent searches” from several available online sources, and looking at the change in those search results.
When we looked at hundreds of unbiased real world search queries and mapped out the amount that each has changed, there is a very clean distribution in terms of how much they have changed. In the real world, radical changes are the exception, not the rule.
Topics Are Not Keywords… And It’s Not Perfect
It’s important not to confuse “topics” with “keywords.” A topic would represent a general subject like “computing,” “marketing,” etc. Specific search terms, like “laptop rental” or “email marketing,” would be related to more general topics.
When you take a look at some search results Google is currently delivering, it’s clear that some of them have been matched up with the wrong topics. One example that’s come up pretty often in discussions I’ve had is “laptop rental.” You would think that folks searching for that would be interested in renting a laptop, but Google returns a list of laptop rental information from universities. Take a look: (http://www.google.com/search?sourceid=navclient&q=laptop+rental)
How could this happen? Looking at the links to those pages, you see a lot of similar topics like computing, housing (students rent housing in dormitories), etc. One savvy company has partnered with some of these universities to offer laptop rentals, and as a result they’re getting a bit of a free ride right now at Google.
Through links pages like this one: (http://computers-notebooks-laptops-lcd-projectorsrentals. com/rglinks.html), the rankings of many university laptop rental pages have been boosted, and as we’ve seen with many other search terms, once you dig into the underlying links, the search results for “laptop rentals” become very easy to understand.
Is it still possible for Google to deliver less than perfect results? Sure. Is it still possible for Google to be fooled? Of course it is. But it’s gotten more difficult, and we can expect Google to remedy many of these situations over time.
Why Some SERPs Have Changed Radically, While Others Have Barely Changed
When you weed out the noise, and look at the real data, it’s not hard to understand why some search terms have been affected more than others. When you dig in and look at similar searches, it gets even easier to see.
Looking at “real estate,” according to Scroogle.org’s methodology, 77 of the top 100 pages dropped out of the top 100. Looking at the more specific “colorado real estate,” 24 of the top 100 dropped out. You can see this pattern repeated over and over again. The more generic searches show more changes in the top results.
Look at the pages that dropped out of the “real estate” top 100. You will see a whole lot of local realtors who managed to link their way (using PageRank and link text) into enviable positions, but not too many are really among the 100 most relevant pages for that query.
The first page I see listed among the “missing” is titled “Southern California Real Estate.” Interestingly enough, that page shows up at #2 for the more specific search “Southern California Real Estate.” In other words, they haven’t been penalized, they just don’t show up where they don’t belong any more.
There are also a few highly competitive search terms where the rankings have changed very little. The existence of these search terms has been used to justify all sorts of theories, but there is a simple explanation for every example I have seen.
The most commonly cited example is “search engine optimization,” where there’s almost no difference in the top 30 pages. If you look at the top ranked ages, you will see that they are already well linked within the community of related sites, and could be expected to do well under a topic-sensitive PageRank system.
It’s also possible that some of these search terms have been used as a testing ground for the new algorithm for quite some time, in which case the radical changes would already have taken place. In the case of “search engine optimization,” there was a pretty significant shake-up earlier this year, which at the time was blamed on “spam penalties.” It now seems more likely that this was the result of testing by Google.
I Could Be Wrong, But It Doesn’t Matter Anyway As I said, this involves a lot of speculation on my part. I’m probably wrong, at least in part. Maybe Google is doing something completely different. Maybe they’re doing some combination of very simple things. However, this fits the facts. Come up with a better explanation, and I’d love to hear it. So far, I haven’t heard a better theory.
It doesn’t really matter anyway. It’s clear enough that whatever Google is doing, the recipe for success is pretty simple. Those sites that have a lot of content and lots of relevant links (both incoming and outbound) have done well. Those that have gotten by with doorway pages and link swaps are no longer quite so successful.
Dan Thies is a well-known writer and teacher on search engine marketing. He offers consulting, training, and coaching for webmasters, business owners, SEO/SEM consultants, and other marketing professionals through his company, SEO Research Labs. His next online class will be a link building clinic beginning March 22