Late last week, Blekko launched the Spam Clock – the search engine’s illustration of how quickly the web is being flooded with spam. More specifically, it counts up the number of spam pages added to the web since January 1. What is not so clear by looking at it, however, is just what Blekko is considering spam (though the page does remind us that spammers are out to: harm users, steal publisher traffic, and defraud advertisers.
We asked Blekko just how it is defining spam from this Spam Clock, and co-founder and CEO Rich Skrenta gave us a pretty long explanation of Blekko’s philosophy on the subject. "The web has gone from 1 billion pages to 100 billion pages in the past 10 years," he tells us. "This is enormous growth. But what are all these pages? Does it make sense that the web is that big?"
Skrenta refers to a paper from some Microsoft researchers talking about how big the "useful" web is, finding that over a period of a year and a half, the total number of pages that searches using msn search ever went to was only 550 million. "Not even 1 billion," says Skrenta.
"Then they started working through the math on whether this made sense," he explains. "We tend to think of Wikipedia as a huge encyclopedia. It is – it’s thousands of times larger than Britannica ever was. But English Wikipedia is only 3.5 million articles. And the growth has slowed. Why has the growth slowed? Because people don’t want to edit Wikipedia anymore? No. It’s because it’s DONE. It already has a page about everything. Aardvarks and the Amazon and acetaminophen and Australia. The world has to make new things, people and events so Wikipedia can add pages for them."
"This is true for other categories of things people might look for," says Skrenta. "There are only 70,000 total titles in the Netflix catalog. There are about 350,000 cities in the world of note. There are about 8 million total products available through Amazon. 15 million US businesses. Millions, not billions."
"Even if you add in 550 million Facebook users and 150 twitter users, you’re still not [at] a billion," he adds. "Add in every tweet and every Facebook status update and you’re still in the low billions. So what are all these 100 billion URLs? And unlike Wikipedia, whose growth has leveled off, the web’s growth is increasing," says Skrenta. "The reason is that it cost[s] virtually nothing to make thousands or millions of new pages, and pages that catch search traffic make money."
So that would appear to be Blekko’s mindset on the state of the web, and the reason the company is betting on community to control search relevancy using the search engine’s slashtag method (More on this here).
"Our approach was to model what we expected the growth in the web to look like over the next year, and after applying a discount for the legitimate pages, to count the rest as useless garbage," says Skrenta. "More new copies of Wikipedia, more markov-chain generated spam blogs, more copies of the InfoUSA and Axciom business datasets clogging the web, more fake review sites with Mechanical Turk authored posts."
We still don’t know exactly what is being "discounted for the legitimate pages". Certainly, there’s a wide range of valuable content between the sites Skrenta mentions and what could widely be considered spam, but proportionally, is Blekko right?
"The numbers are rough, but we can boil it down to approximately 90% of the (very conservatively estimated) 10 billion pages that will be added to the web over the next year to be spam," he says. "That works out to about a million pages an hour. Frankly this is probably under-counting the spam given the growth in spam tweets and fake Facebook accounts, but it’s mainly for illustrative purposes, so the round number worked."
"The massive flooding of the web with endless copies and permutations and shadows of existing things is what is pulling the rug out from under link-based search rankings," Skrenta concludes. "Links don’t represent a human voting on the quality of a site anymore."
Whether or not Blekko finds mainstream success as a search engine, this is why social has become such an important factor in search relevancy, and why the major search engines have continued to move in this direction. It’s about trust.
What do you think of Blekko’s evaluation of the web? Share your thoughts here.