We Can’t Judge Relevance

    November 20, 2007

Yesterday, I argued that Microsoft’s search engine update, which included highly touted relevance improvements, did not and will not ultimately improve their fortunes.

To state things a bit more strongly: I think that relevance cannot be a selling point for a search engine, and not just because Live’s update was just catching them up to the level of many other popular search engines. In fact, I think that it’s hard for any of us to truly evaluate “relevance” in results.

There are a number of reasons that I think this is so. First and probably foremost, search engines try (as much any computer can “try”) to understand user intent, but they aren’t all that great at it. While they’ve been preprogrammed to return a specific type of results page for queries they recognize (like music pages at Ask, Yahoo and Google), they can’t automatically parse and understand what you’re looking for.

For example, a search for [beach park virginia] returned nothing I saw as relevant. While I knew the intent behind that phrase, Google could only find pages that had all of the same words there. Whose fault is that?

Sometimes search engines have a bit of help in picking up on user intent. That’s the premise behind personalization: gather enough data about a person’s search habits and you’ll be able to understand what it is they want when they type in [oneida].

But even personalized systems aren’t perfect. Yesterday, perhaps I was looking for flatware; today, I might be researching New York Indian tribes. Tomorrow, religious collective movements. This is part of the reason why there is a limit to how personalized Google has made its personalized results.

Like personalized results, our perception of relevance is subjective. As Phillip Lenssen said on a comment on a Google Operating System post last month:

A 51% “success rate” could mean Google is “only” the best 51% of the time, or it could mean Google returns exactly the right result for the tastes of 51% of the people, and that the other people prefer other types of results.

As we know, the brand associated with a SERP (and the image of the brand) has a great affect on how relevant we think the results are—even when the results are, in fact, exactly the same.

Add to this the fact that the snippets on the page aren’t long enough or detailed enough for us to really tell what we’re clicking through to. A site could be totally on-point for my query, but if it requires me to register, forces music upon me, features a horrific amount of ads or is simply completely illegible, I won’t be able to consider it “relevant.” (And I will run far, far away.)

Once we get to a site, design and other “irrelevant” factors affect our perception of a site, making it difficult to isolate ‘relevance’ alone as a cause for someone to hit the ‘back’ button. And who’s to say that hitting the ‘back’ button means a site is irrelevant, anyway? How many times have you gotten the information you needed and were done with a site?

Objective measures of relevance, on the other hand, are made in a vacuum. They are far outside the real world and our realm of experience. In an objective measure of relevance, the tester types in a query, which they probably didn’t choose. [Apple], perhaps.

And then tester judged how relevant the results are. But “relevance” here isn’t determined by what the searcher really wanted when they typed in the query: it’s what the research team decided was the “right” answer when you ask a search engine “[Apple]?” If their definition doesn’t include Braeburns, suddenly the search engine is wrong.

One of my college professors called this problem with research “The Utterly Boring World.” In this world, The man bit the sandwich is a perfectly fine construct, while *The sandwich bit the man is ungrammatical because it is nonsensical. But there is a place for nonsense in the real world—and a place for Braeburns on a SERP, even if that wasn’t what you were looking for. It might be exactly what someone else wanted.