Google Gives More Details On Human Raters
Google has people that it pays to rate the quality of search results. They’re called raters. Google mentioned them last year in a widely publicized interview with Wired – the interview, in fact, in which the Panda update’s name was revealed.
Given that the Panda update was all about quality, many webmasters became very interested in these raters and their role in the ranking process.
Google talked about them a little at PubCon in November, and that in December, Google’s Matt Cutts talked about them some more, saying, ““Even if multiple search quality raters mark something as spam or non-relevant, that doesn’t affect a site’s rankings or throw up a flag in the url that would affect that url.”
Cutts posted a new video about the raters today, giving some more details about how the process works.
http://t.co/9Nhn44TP Please RT!Very in-depth video today about how Google uses human eval rater data in search:
“Raters are really not used to influence Google’s rankings directly,” says Cutts in the video. “Suppose an engineer has a new idea. They’re thinking, oh, I can score these names differently if I reverse their order because in Hungarian and Japanese that’s the sort of thing where that can improve search quality. What you would do is we have rated a large quantity of urls, and we’ve said this is really good. This is bad. This url is spam. So there are 100s of raters who are paid to, given a url, say is this good stuff? Is this bad stuff? Is it spam? How useful is it? Those sorts of things.”
“Is it really, really just essential, all those kinds of things,” he continues. “So once you’ve gotten all those ratings, your engineer has an idea. He says ‘OK, I’m going to change the algorithm.’ He changes the algorithm and does a test on his machine or here at the internal corporate network, and then you can run a whole bunch of different queries. And you can say OK, what results change? And you take the results the change and you take the ratings for those results and then you say overall do the return– do to the results that are returned tend to be better, right? They’re the sort of things that people rated a little bit higher rather than a little bit lower? And if so, then that’s a good sign, right? You’re on the right path.”
“It doesn’t mean that it’s perfect, like, raters might miss some spam or raters might not notice some things, but in general you would hope that if an algorithm makes a new site come up, then that new site would tend to be higher rated than the previous site that came up,” he continues. “So imagine that everything looks good. It looks like it’s a pretty useful idea. Then the engineer, instead of just doing some internal testing, is ready to go through sort of a launch evaluation where they say how useful is this? And what they can do is they can generate what’s called a side by side. And the side by side is exactly what it sounds like. It’s a blind taste test. So over here on the left-hand side, you’d have one set of search results. And on the right-hand side you’d have a completely different set of search results.”
Google showed the raters in a video last year, which actually showed a glimpse of the side-by-side:
“If you’re a rater, that is a human rater, you would be presented with a query and a set of search results,” Cutts continues. “And given the query, what you do is you say, “I prefer the left side, ” or “I prefer the right side.” And ideally you give some comments like, ‘Oh, yes, number two here is spam,’ or ‘Number four here was really, really useful.’ Now, the human rater doesn’t know which side is which, which side is the old algorithm and which side is the new test algorithm. So it’s a truly blind taste test. And what you do is you take that back and you look at the stuff that tends to be rated as much better with the new algorithm or much worse with the new algorithm.”
“Because if it’s about the same then that doesn’t give you as much information,” he says. “So you look at the outliers. And you say, ‘OK, do you tend to lose navigational home pages? Or under this query set do things get much worse?’ And then you can look at the rater comments, and you can see could they tell that things were getting better? If things looked pretty good, then we can send it out for what’s known as sort of a live experiment. And that’s basically taking a small percentage of users, and when they come to Google you give them the new search results. And then you look and you say OK, do people tend to click on the new search results a little bit more often? Do they seem to like it better according to the different ways that we try to measure that? And if they do, then that’s also a good sign. ”
Cutts acknowledges that the raters can get things wrong, and that they don’t always recognize spam.