Putting Behavioural Metrics In Perspective

    January 15, 2009

So here’s the question; are behavioural metrics being used in modern search? You do remember them right? Those warm and fuzzy little signals such as bounce rates that there all the rage in late 2008 in the search engine optimization world? Sure you do… but let’s take one last look.

Although bounce rates received the biggest attention, we would be remiss not to start by quickly listing some signals commonly looked at by information retrieval folks. The two elements include implicit and explicit data (actions and interactions) – examples can include;

Implicit signals

  1. Query history (search history)
  2. SERP interaction (revisions, selections and bounce rates)
  3. User document behaviour (time on page/site, scrolling behaviour);
  4. Surfing habits (frequency and time of day)
  5. Interactions with advertising
  6. Demographic and geographic
  7. Data from different application (application focus – IM, email, reader);
  8. and closing a window.

Explicit signals

  1. Adding to favourites
  2. Voting (a la Search Wiki or toolbar)
  3. Printing of page
  4. Emailing a page to a friend (from site)

Now that we’re past that let’s get a little geeky so those information retrievers don’t shake their heads to hard at us – the terminology. I am as guilty as the next Gypsy of flinging the term ‘behavioural metrics’ about over the last year or so, even performance metrics. If you want to research this more, start by using the term; implicit/explicit user feedback signals – because that’s what we’re talking about.

This is not the ranking signal U were looking for
and thanks to Steve Gerencser for sending the pic

Follow the bouncing timeline

While you can trace the timeline back in the search (blogging/reporting) world many years, it really came home when Search Engine Watch mentioned it (Oct.2008) followed a few months later in a Search Engine Land post. Being the venerable publications that they are, grumblings around the SEO world soon followed. If you go and do some buzz monitoring and searching (which I have), much of the talk began after that. The cracks in the damn began to fissure and this Gypsy was left without enough chewing gum.

So what can we do? Where does one start to truly look for answers as to the potential of such methods being implemented by top public access search engines? It would stand to reason that we begin looking at the information retrieval world itself. Over the last month I have given the benefit of the doubt to the community and gone deeper to find some type of more definitive answer, (list of research papers at the end).


Inherent problems with implicit signals

One thing that became obvious real fast is that the IR world is still not entirely sure of the value for implicit feedback signals as far as how to infer engagement and satisfaction. While there are a long list of problematic areas let’s consider;

  1. You save the link for later and continue my search (in Doc let’s say)
  2. You found what u needed on the page and went looking for more information
  3. You walk away from my browser and leave the window on a page for an hour
  4. Multiple users in your home during a given session
  5. Open a listing in a new window (when further tracking is unavailable)
  6. You found the information in a SERP snippet and selected nothing
  7. You were unsatisfied with the page selected and dug 3 pages deeper (unsatisfied, not engaged)
  8. Queries from automated tools (like a rank checker) which adds noise to overall data
  9. SERP bias – do peeps simply click the top x results regardless of relevance?
  10. Different users having different understanding of the relevance of a document (result)

…and on and on. Think about it, some situations can tell the search engine you’re pleased with the results and other times such signals mean nothing. You see, the essential motive is to attempt to assign an emotional evaluation of engagement with the search results. Unfortunately there are too many noisy elements which make this a very difficult task to do effectively.

Noise and confused

It’s widely felt that ‘implicit feedback is more difficult to interpret and potentially noisy’ as noted in – Evaluating the Accuracy of Implicit Feedback from Clicks and Query Reformulations in Web Search (partially funded via grant from Google) – in looking at click behaviour there was indeed a clicking bias based on a few elements;

“….First, we show that there is a “trust bias” which leads to more clicks on links ranked highly by Google, even if those abstracts are less relevant than other abstracts the user viewed.

Second, there is a “quality-of-context bias”: the users’ clicking decision is not only influenced by the relevance of the clicked link, but also by the overall quality of the
other abstracts in the ranking.”

Other research (on click data) looked at how users actually interact with search results as far as bias is concerned. People are often consistent in clicking patterns (clicking top result, second, third) regardless of the underlying data. This means the entire data set can be skewed as not clicking on the 8th result may no necessarily be a vote against the link in the result, but more of an ingrained habit on the part of the searcher.

They summarized;

“Our results show that click behaviour does not vary systematically with the quality of search results. However, click behaviour does vary significantly between individual users, and between search topics. This suggests that using direct click behaviour—click rank and click frequency—to infer the quality of the underlying search system is problematic.”

And also;

“Analysis of our user click data further showed that the action of clicking is not strongly correlated with relevance —only 52% of clicks in a search result list led to a document that the user actually found to be relevant. Attempts to use clicks as an implicit indication of relevance should therefore be treated with caution.” From – Using Clicks as Implicit Judgments: Expectations Versus Observations

Beyond that many of the papers have various elements of implicit user feedback that they felt warranted more study. In short, there is no consensus in the IR community about the validity of these signals – they’re not ready for prime time.


The Spam connection

And this my friends, as they say, is the proverbial fly in the ointment. While there is a ton of research and even patents on behavioural metrics, dealing with click-spam has not been addressed in any detail to this point. Many of the papers openly admit they are light in the spam detection area and more research is needed.

A natural question that arises in this setting is the tolerance of this method to noise in the training data, particularly should users click in malicious ways. While we used noisy real-world data, we plan to explicitly study the effect of noise, words with two meanings, and click-spam on ourapproach. From – Query Chains: Learning to Rank from Implicit Feedback

And that’s just one; it was a common theme among the papers on the topic. This for me goes a long way into understanding that it is premature to suggest search engines that we optimize for are using such signals. There is hope as some tests, as ran by Microsoft, concluded;

“ranking accuracy decreases indeed when more documents are spammed, but the decrease is within a small range. When only a small number of documents are spammed per query, ranking accuracy is only slightly affected even if a large number of queries are spammed.” From- Are click-through data adequate for learning web search rankings?

They felt that such a large percentage of queries are long tail queries that it would be more difficult to effectively disrupt the majority of query spaces (I hear Ralph rumbling some where with that one). But once more, there seems to be a lot more work to be done in this area to effectively combat spam in such a system. To this we add thoughts from a Cornell paper;

“… it might also be possible to explore mechanisms that make the algorithm robust against “spamming”. It is currently not clear in how far a single user could maliciously influence the ranking function by repeatedly clicking on particular links.” From – Optimizing Search Engines using Click through Data – Cornell (pdf)

For me, there simply isn’t enough research or hard data to suggest that the spam issues related to implicit user feedback and click data have been solved. This is a crucial element to the case of them being used today by Google or anyone else.

Not enough, then also try this recent post by your friend and mine, CJ, on Clickstream spam detection or Fantomaster’s Behavioral Metrics and the Birth of SEO Surfbot Nets – let us get to then now shall we?


Getting beyond the geeky; looking to the future

Are we getting somewhere yet? Great… but it’s not all doom and gloom, no need to call the corner just yet. You see, for the most part researchers have been finding some great improvements in search performance; they simply haven’t worked out all the values of such signals nor the spam concerns. In an enterprise environment, where manipulation/spam is far less likely, implicit feedback can be a more useful tool. It is the larger public access environment where spam is far more prevalent that the nut has yet to be cracked.

I stand on my original assertion that this type of approach is best served in a personalized environment. This would be huge in dealing with the apparent issues surrounding spam related issues as it is kinda’ hard to spam ones self you see. This makes personalized a likely candidate for user feedback signals. Either way, it simply hasn’t been solved yet

So what are we left with?? Some noisy signals that are spammable… hmmm… where have we heard that before?

Matt Cutts on bounce rates

And so now I leave all of this in your capable hands my weary web warriors. If you can go through the research papers listed below (or elsewhere) and find me strong evidence of how they deal with noise reduction and click-spam, then we can discuss it further. That is my challenge to you; because from what is out there, it is not yet viable in a large scale environment.

I submit to you, my enthusiastic optimizers, that bounce rates and it’s implicit feedback brethren are simply not likely to be in Google’s (nor any major search engine’s) current ranking schemes. It is a novelty item at best with potential in a personalized environment.

Care to dispute this? I am more than happy to review any research to the contrary.

Want to know what I think is causing us to see what we believe this to be? You’re just going to have to wait until next week.