Search Engine Patents and Panda

Bill Slawski is the president and founder of SEO by the Sea, and has been engaging in professional SEO and internet marketing consulting since 1996. With a Bachelor of Arts Degree in English from the University of Delaware, and a Juris Doctor Degree from Widener University School of Law, Bill worked for the highest level trial Court in Delaware for 14 years as a court manager and administrator, and as a technologist/management analyst. While working for the Court, Bill also began to build and promote web pages, and became a full time SEO in 2005. Working on a wide range of sites, from Fortune 500 to small business pages, Bill also blogs about search engine patents and white papers on his seobythesea.com blog.

What are the Most Likely Signals Used by Panda?

Eric Enge: Let’s chat about some of the patents that might be playing a role in Panda 1, 2, 3, 4, 5, 6, 7 and beyond. I would like to get your thoughts on what signals are used for measuring either content quality or user engagement.

Bill Slawski: I’ve been looking at sites impacted by Panda. I started from the beginning with remedial SEO. I went through the sites, crawled through them, looked for duplicate content issues within the same domain, looked for things that shouldn’t be indexed that were, and went through the basic list that Google provides in their Webmaster Tools area.

In the Wired interview with Amit Singhal and Matt Cutts regarding this update, they mentioned an engineer named Panda. I found his name on the list of papers written by Googlers and read through his material. I also found three other tool and systems engineers named Panda, and another engineer who writes about information retrieval and architecture. I concluded that the Panda in question was the person who worked on the PLANET paper (more on this below).

For signals regarding quality, we can look to the lists of questions from Google. For example, Does your web site read like a magazine? Would people trust you with their credit card? There are many things on a web site that might indicate quality and make the page seem more credible and trustworthy and lead the search engine to believe it was written by someone who has more expertise.

The way things tend to be presented on pages, for instance where eight blocks are shown, may or may not be signals. If we look at the PLANET whitepaper “Massively Parallel Learning of Tree Ensembles with MapReduce” its focus isn’t so much on reviewing signals with quality or even user feedback but, rather, how Google is able to take a machine learning process dealing with decision trees and scale it up to use multiple computers at the same time. They could put many things in memory and compare one page against another to see if certain features and signals appear upon those pages.

Eric Enge: So, the PLANET whitepaper described how to take a process, which before was constrained to a one computer machine learning process, and put it into a distributed environment to gain substantially more power. Is that a fair assessment?

Bill Slawski: That would be a fair assessment. It would use the Google file system and Google’s MapReduce. It would enable them to draw many things into memory to compare to each other and change multiple variables at the same time. For example, a regression model type approach.

Something that may have been extremely hard to use on a very large dataset becomes much easier when it can scale. It’s important to think about what shows up on your web page as a signal of quality.

It’s possible that their approach is to manually identify pages that have quality, content quality, presentation, and so on and use those as a seed set to use with the machine learning process. To identify other pages, and how well they may rank in terms of these different features, makes it harder for us to determine expressly which signals the search engines are looking for.

If they are following this PLANET-type approach in Panda with the machine learning, there may be other things mixed in. It is hard to tell. Google may not have solely used this approach. They may have tightened up phrase-based indexing and made that stronger in a way that helps rank and re-rank search results.

Panda may be a filter on top of those where some web sites are promoted and other web sites are demoted based upon some type of quality signal score.

It appears that Panda is a re-ranking approach. It’s not a replacement for relevance and Page Rank and the two hundred plus signals we are used to hearing about from Google. It may be a filter on top of those where some web sites are promoted and other web sites are demoted based upon some type of quality signal score.

Eric Enge: That’s my sense of it also. Google uses the term classifier so you could imagine, either before running the basic algorithm or after, it is similar to a scale or a factor up or down.

Bill Slawski: Right. That’s what it seems like.

Page Features as an Indicator of Quality

Eric Enge: You shared another whitepaper with me which dealt with sponsored search. Does that whitepaper add any insight into Panda? The PLANET paper followed up on an earlier paper on sponsored search which covered predicting bounce rates on ads. It Looked at the landing pages those ads brought you to based upon features found on the landing pages.

They used this approach to identify those features and then determined which ones were higher quality based upon their feature collection. Then they could look at user feedback, such as bounce rates, to see how well they succeeded or failed. This may lead to metrics such as the percentage of the page above the fold which has advertising on it.

Bill Slawski: Now you are talking about landing pages so many advertisers may direct someone to an actual page where they can conduct a transaction. They may bring them to an informational page, or an informational light page, that may not be as concerned with SEO as it is with calls to action, signals of reassurance using different logos, and symbols that you would get from the security statistical agencies.

That set of signals is most likely different from what you would find on a page that was built for the general public or for search engines. However, if you go back to the original PLANET page they said, “this is sort of our proof of concept, this sponsored search thing. If it works with that it can work well with other very large datasets in places like organic search.”

Eric Enge: So, you may use bounce rate directly as a ranking signal but when you have newer information to deal with why not predict it instead?

Bill Slawski: Right. If you can take a number of features out of a page and use them in a way that gives them a score, and if the score can match up with bounce rate and other user engagement signals, chances are a feature-based approach isn’t a bad one to take. Also, you can use the user behavior data as a feedback mechanism to make sure you are doing well.

Eric Enge: So, you are using the actual user data as a validator rather than a signal. That’s interesting.

Bill Slawski: Right. You could do the same thing with organic search which, to a degree, they did that with blocked pages signal. This is where 85% of pages that were blocked were also pages that had lower quality scores. You can also look at other signals, for example, long clicks.

Eric Enge: Long clicks, what’s that?

Bill Slawski: I dislike the term bounce rate because it, by itself, doesn’t conclusively infer that someone visits the page and then leaves in under a few seconds. It implies that someone goes to a page, looks at it, spends time on it, and then leaves without going somewhere else. A long click is when you go to a page and you actually spend time there.

Eric Enge: Although, you don’t know whether or not they spent time there because they had to deal with a phone call.

Bill Slawski: Or, they opened something else up in a new tab and didn’t look at it for a while. There are other things that could measure this and ways to confirm agreement with it, such as how far someone scrolls that page.

Eric Enge: Or, if they print the page.

Bill Slawski: And clicks at the bottom of the page.

Eric Enge: Or clicks on some other element. Could you track cursor movements?

Bill Slawski: There have been a couple patents, even some from Google, on tracking cursor movements that they may possibly use someday. These could give them an indication of how relevant something may, or may not, be to a particular query.

One patent is described as being used on a search results page, and it shows where someone hovers for a certain amount of time. If it’s a search result, you see if they hover over a one-box result which may give them an incentive to continue showing particular types of one-box results. That’s a possibility, mouse pointer tracking.

Bounce Rates and Other User Behavior Signals

Eric Enge: Getting back to the second whitepaper, what about using the actual ad bounce rate directly as a signal because that’s also potentially validating a signal either way?

Bill Slawski: It’s not necessarily a bad idea.

Eric Enge: Or low click through rates, right?

Bill Slawski: As we said, user signals sometimes tend to be noisy. We don’t know why someone might stay on one page longer than others. We don’t know if they received a phone call, if they opened it up in a new tab, if they are showing someone else and have to wait for the person, or there are plenty of other reasons.

You could possibly collect different user behavior signals even though they may be noisy and may not be an accurate reflection of someone’s interest. You could also take another approach and use the user behavior signals as feedback. To see how your methods are working, you have the option to have a wider range of different types of data to check against each other.

Rather than having noisy user data be the main driver for your ranking… you look at the way content is presented on the page.

Bill Slawski: That’s not a bad approach. Rather than have noisy user data be the main driver for your rankings, you find another method that looks at the way content is presented on a page. One area is segmentation of a page which identifies different sections of a page by looking at features that appear within those sections or blocks, and which area is the main content part of a page. It’s the part that uses full sentences, or sometimes sentence fragments, uses periods and traumas, capital letters at the beginning of lines or text. You use a Visual Gap Segmentation (White Space) type process to identify what might be an ad, what might be navigation, where things might be such as main content areas or a footer section. You look for features in sections.

For instance, a footer section is going to contain a copyright notice and being able to segment a page like that will help you look for other signals of quality. For example, if an advertisement appears immediately after the first paragraph of the main content area you may say, “well, that’s sort of intrusive.” If one or two ads take up much of the main space, that aspect of the page may lead to a lower quality score.

How the Search Engines Look at a Page

Eric Enge: I understand how features may impact the search engine’s perception of a page’s quality, but that presumes they can unravel the CSS to figure out where things are really appearing.

Bill Slawski: Microsoft has been writing white papers and patents on the topic of Visual Gaps Segmentation since 2003. Google had a patent called “Determining semantically distinct regions of a document” involving local search where they could identify blocks of text reviews for restaurants or other places that may be separated.

For example, you have New York, a village voice article about restaurants in Greenwich Village, and it has ten paragraphs about ten different restaurants, starts with the name of the restaurant in each paragraph, and ends with the address, and in between is review.

This patent said, “we can take that page, segment those reviews, and identify them with each of the individual restaurants,” and then two or three paragraphs sets they say, “we can also use the segmentation process in other ways like identifying different sections of a page, main content, a header, a footer, or so on.” Google was granted a patent on a more detailed page segmentation process about a month ago.

Bill Slawski: Segmentation is probably part of this quality review, being able to identify and understand different parts of pages. They don’t just look at CSS. In the days where tables were used a lot you had the old table trick.

You moved the content up and, depending on how you arranged a table, you could use absolute positioning. With CSS you can do the same type of thing, but the search engine is going to use some type of simulated browser. It doesn’t render a page completely, but it helps them give an idea if they look at the DOM (Document Object Model) model of a page.

They look at some simulation of how the page will render, like an idea of where white space is, where HR tags might be throwing lines on the page, and so on. They can get a sense of what appears where, how they are separated, and then try to understand what each of those blocks does based upon linguistic-based features involving those blocks.

Is it a set of multiple single word things that have links attached to them? For instance, each one is capitalized that might be main navigation. So, you can break up a page like that, you can look at where things appear. That could be a signal, a quality signal. You can see how they are arranged.

The Search Engines Understand That There Are Different Types of Sites

Eric Enge: Does the type of site matter?

Bill Slawski: Most likely there is some categorization of types of sites so you are not looking at the same type of quality signals on the front page of a newspaper as you are on the front page of a blog or an ecommerce site.

You can have different types of things printed on those different places. You are not going to get a TRUSTe badge on a blog, but you might on an ecommerce site. You look at the different features and realize that different genres, different types of sites, may have different ones associated with them.

Eric Enge: Yes.

Bill Slawski: That may have been derived when these seed quality sites were selected. There may have been some preprocessing to identify different aspects such as ecommerce site, labels, blog labels, and other things so whatever machine learning system they used could make distinctions between types of pages and see different types of features with them.

It’s called a Decision Tree Process, and this process would look at a page and say, “is this a blog, yes or no? Is this a new site, yes or no?” It crawls along different pathways and asks questions to go crawl over that vital score.

Eric Enge: Other things you can look at are markers of quality, such as spelling errors on the page. I think Zappos, if I remember correctly, is currently editing all their reviews because they’ve learned that spelling errors and grammar affect conversion. So, that’s a clear signal they could potentially use, and the number of broken links is another.

Another area that’s interesting is when you come to a page and it is long block of text. There may be a picture on top, but that’s probably a good predictor of a high bounce rate. If it is a research paper, that’s one thing, but if it is a news article that is something else.

Bill Slawski: Or, if it’s the Declaration of Independence.

Eric Enge: Right, but they can handle that segmentation. If someone is looking for a new pair of shoes, and they come to a page with ten paragraphs of text and a couple of buttons to buy shoes, that’s a good predictor of a high bounce rate.

Bill Slawski: On the other hand, if you have a page where there is a H1 header and a main heading at the top of the page, a couple of subheadings, a list, and some pictures that all appear to be meaningful to the content of the page, that would be a well-constructed article. It’s readable for the web, it’s easy to scan and it’s easy to locate different sections of the page that identify different concepts. This may make the page more interesting, more engaging, and keep people on a page longer.

So, do these features translate to the type of user behavior where someone will be more engaged with the page and spend more time on it? Chances are, in many cases, they will.

User Engagment Signals as a Validator

Eric Enge: Another concept is user engagement signals standing by themselves may be noisy but ten of them collectively probably won’t be noisy. You could take ten noisy signals and if eight of them point in the same direction, then you’ve got a signal.

Bill Slawski: They reinforce each other in a positive manner.

Eric Enge: Then you are beginning to get something which is no longer a noisy signal.

Bill Slawski: Right. For example, if you have a warehouse full of people, in an isolated area, printing out multiple copies of the same document over and over and over, because they think printing a document is a user behavior signal that the search engine might notice, you are wasting a lot of paper and a lot of time.

In isolation that is going to look odd, it’s going to be an unusual pattern. The search engine is going to say, “someone is trying to do something they shouldn’t be doing.”

Eric Enge: Yes. That can become a direct negative flag, and you must be careful because your competitor could do it to you. So, the ballgame seems to go on. What about misleading information which was covered by a Microsoft white paper?

Bill Slawski: That was about concepts involving web credibility that Microsoft attempted to identify. It involved both on-site factors and off-site factors, and a third category, called aggregated information, which was the user behavior data they collected about pages. If you had on-site factors such as security certificates, logos, and certain other features, that would tend to make you look more credible. The emphasis is more on credibility than quality. It seems that the search engines are equating credibility with quality to a degree.

Bill Slawski: The AIRWeb Conference, which was held five years in a row but not held last year, was held again this year. It covered adversarial information retrieval on the web in conjunction with another workshop on credibility. They called it the 2010 Web Quality Conference and it was shared by people from Google, Microsoft, Yahoo and a number of academic participants.

Design actually plays a very important part, maybe bigger than most people would assume when it comes to people assessing whether or not this site is credible or not.

You can go back a number of years to the Stanford persuasive technologies laboratory’s research and work on credibility. One of the findings stated, on a study of five thousand web sites or so, that design plays an important part, maybe bigger than most people would assume, when it comes to people assessing whether or not this site is credible or not.

They also came out with a series of guidelines that said certain things that will make your web site appear more credible to people. It included photographs of people behind the site, explicitly showing an address, having privacy policy or ‘about us’ page, or terms of service. These are on-page signals you could look at.

There are many off-page signals you could look at such as winning a Webby Award, being recognized in other places, being cited on authoritative type sites, or even page rank which they said they would consider as a signal to determine whether or not a page was a quality page. In the Microsoft paper they said they will look at page rank, which was interesting.

Populating Useful Information Among Related Web Pages

Eric Enge: Then you have the notion of brand searchers. If people are searching for your brand, that’s a clear signal. If you have a no-name web site and there are no searches for the web site name or the owner’s company name.

Bill Slawski: That stirs up a whole different kettle of fish, and it leads to how do you determine whether or not a page is an authority page. For instance, Google decides, when somebody types ESPN into their search box on the toolbar, the ESPN web site should be the first one to come up. It doesn’t matter much what follows it. If they type Hilton but it goes into the topic of data the search engines identify as named entities, or specific people, and places ; how do they then associate those with particular query terms, and if those query terms are searched for how do they treat them?

Do they look at it as a navigational query and ensure the site they associated with it comes up? Do they imply site search and show four, five, six, seven different results from that web site in the top ten which Google had been doing for a good amount of time?

Eric Enge: Even for a non-brand search, for instance, Google surely associates Zappos with shoes. Right? So, in the presence of the authority, compared to some other new shoe site, you could reference the fact that the brand name Zappos is searched a bunch and that could be a direct authority signal for any search on the topic of shoes.

Bill Slawski: Right. Let us discuss a different patent from Google that explores that and goes into it in more detail. There was one published in 2007 that I wrote about called “Populating useful information among related web pages.” It talks about how Google determines which web site might be associated with a particular query and might be identified as authoritative of it.

In some ways, it echoes some of the things in the Microsoft paper about misinformation about authority. It not only looks at things it may see on the web, such as links to the pages using anchor text with those terms, but it may also look to see whether or not the term is a registered trademark that belongs to the company that owns a particular web site. It may also look at the domain name or yellow page entries.

One of the authors of this patent also wrote a number of the local search patterns which, in some parts, say that citations are just as good as links. The mention of a particular business at a particular location will more likely rank higher if somebody does a search for businesses of that type in that location . So, this patent from Google expands beyond local search to find authoritative web pages for particular queries.

Rejecting Annoying Documents

Eric Enge: Excellent. Since we are getting towards the end I’d like your thoughts on annoying advertisements.

Bill Slawski: Google came up with a patent a few years ago which, in some ways, seems a bit similar to Panda. It focused upon features on landing pages and the aspects of advertisements. It was called “Detecting and rejecting annoying documents”.

It provided a list of the types of things they may look at in ads, on landing pages, the subject matter, characteristics rating, what type of language it uses, geographically where is it from, and who is the owner of the content.

Eric Enge: It may even detect content in images using OCR or other kinds of analysis to understand what is in an image.

Bill Slawski: Right, and also locate Flash associated with an ad, locate the audio that might be played, look at the quality of images, and the fact that they are animated or not. It was a big list. I do not know if we will see a patent anytime soon from Google that gives us the same type of list involving organic search and the Panda approach. Something might be published two, three or four years from now.

Eric Enge: It’s interesting. Obviously, what patents they are using and not using is something you don’t get visibility to unless you are in the right particular building at the right time at the Googleplex.

It seems to me the underlying lesson is that you need to be aware of search engines and, obviously, make search engine savvy web sites. The point is you need to focus on what people should have focused on all along which is: What do my users want? How do I give it to them? How do I engage them? How do I keep them interested? Then create a great user experience because that’s what they are trying to model.

My perspective is search engines are another visitor to your web site like anybody else.

Bill Slawski: Right. My perspective is that search engines are another visitor to your web site like anybody else. They may have different requirements. There may be some additional technical steps you have to take for your site to cater to them, but they are a visitor and they want what other visitors to your site want. They want to fulfill some type of informational or situational need. They want to find information they are looking for. They want to buy what you offer if, in the snippets that show up in search results, that’s what you do offer.

If you are a web site that’s copying everybody else and not adding anything new or meaningful, not presenting it in a way that makes it easier to read and easier to find, and there is nothing that differentiates you or sets you apart, then you are not treating potential visitors the best way you can.

When you do SEO, even in the age of Panda, you should be doing all the basics. It’s a re-ranking approach. You need to get rid of the same content with multiple different URLs, get rid of pages that are primarily keyword insertion pages where a phrase or two or three changes but the rest of everything stays the same.

When you write about something, if you are paying attention to phrase-based indexing, make sure you include related information that most people would include on that page, related terms and so on. Those basics don’t go away and they may be more important now than they were in the past.

Yes. As a searcher, as someone who helps people with web sites, and as someone who may present my own stuff on web sites, I want to know how it works. When I do a search, I want to make sure I am finding the things that are out on the web.

Get some sweat equity going and make sure your stuff is stuff people want to see, learn about the search space as much as you can.

Bill Slawski: The things I need, or want, or hope to see, and anything Google can do to make this better, I think everybody wins. That may be more work for people putting content on the web, but the cost of sweat is fairly cheap. Get some sweat equity going and make sure your stuff is stuff people want to see, learn about the search space as much as you can.

As a ranking signal we have relevance, we have importance and, increasingly, we have content quality.

Eric Enge: How is life for you otherwise?

Bill Slawski: I have been trying to keep things local, get more involved in my local community, and do things with the local Chamber of Commerce. I now live in an area that’s much more rural in Northwestern Virginia and some of these local business people need the help.

I am really close to DC and have been trying to work more with nonprofits. Instead of traveling, I am meeting many people locally, helping people learn more about what they can do with their web sites and that’s pretty fulfilling.

Bill Slawski: I live in horse country now; there might actually be more horses in my county then there are people.

Eric Enge: Thanks Bill!

Originally published at Ramblings About SEO