RSS Home Newsletter Advertising
Visit Twellow.com

Netflix Dataset Cracked, Subscribers Profiled

Netflix offered a million dollar reward to anyone who could improve upon their recommendation engine by ten percent. Two researchers accomplished a lot more with the "anonymized" dataset. The Netflix Prize provided researchers with records comprising 100,480,507 movie ratings made by 480,189 subscribers, made between December 1999 and December 2005. The company challenged people to beat Netflix at its own recommendations.

The physics arXiv blog noted Netflix claimed to have removed personal details from the dataset before making it available. However, Arvind Narayanan and Vitaly Shmatikov at the the University of Texas at Austin figured out how to de-anonymize that data.

The research paper on how they did it demonstrated the inherent risk in publishing such micro-data, or information about specific individuals.

"Using the Internet Movie Database as the source of background knowledge, we successfully identified the Netflix records of known users, uncovering their apparent political preferences and other potentially sensitive information," the researchers said in the paper's abstract.

Through their algorithmic work, the researchers could tie information in the Netflix dataset with recommendations made on the Internet Movie Database website:

We expect that for Netflix subscribers who use IMDb, there is a strong correlation between their private Netflix ratings and their public IMDb ratings. Note that our attack does not require that all movies rated by the subscriber in the Netflix system be also rated in IMDb, or vice versa. In many cases, even a handful of movies that are rated by the subscriber in both services would be sufficient to identify his or her record in the Netflix Prize dataset...
Briefly, people who rated movies publicly around the same time they rated those movies privately gave the researchers enough data to figure out details about one person.

"A natural question to ask is why would someone who rates movies on IMDb - often under his or her real name - care about privacy of his movie ratings?" the researchers asked.

"Consider the information that we have been able to deduce by locating one of these users’ entire movie viewing history in the Netflix dataset and that cannot be deduced from his public IMDb ratings."

Here's where the de-anonymization becomes scary:

First, we can immediately find his political orientation based on his strong opinions about "Power and Terror: Noam Chomsky in Our Times" and "Fahrenheit 9/11."

Strong guesses about his religious views can be made based on his ratings on "Jesus of Nazareth" and "The Gospel of John". He did not like "Super Size Me" at all; perhaps this implies something about his physical size?

Both items that we found with predominantly gay themes, "Bent" and "Queer as folk" were rated one star out of five. He is a cultish follower of "Mystery Science Theater 3000".

This is far from all we found about this one person, but having made our point, we will spare the reader further lurid details.

"We extracted from the Netflix Prize dataset non-public information about some subscribers that should be considered sensitive by any reasonable definition," the researchers noted, an accomplishment that should give people reason to be concerned about the true privacy they enjoy online.

follow me on Twitter

Comments

Post new comment

The content of this field is kept private and will not be shown publicly.
CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.
8 + 4 =
Solve this simple math problem and enter the result. E.g. for 1+3, enter 4.