Visitor Segmentation

    November 14, 2006

(Quick Note —

Disciplined Search Engine Optimization
Establishing a Measurement and Process Discipline for SEO

This coming Wednesday Paul Bruemmer and I are going to be doing a webinar sponsored by WebSideStory on SEO – talking about how to put a discipline of measurement and process around a serious search optimization effort. It’s a pretty interesting topic – because I think SEO, particularly from a measurement standpoint, is very poorly understood. We’ve been working with Paul and RedDoor on various things for a while now, and I think you’ll find his methodical approach enlightening and refreshing in a discipline that often seems mysterious and chaotic. You can register to join us at

Visitor Segmentation: Segment Methodology

The discussion of segment creation capabilities in the last post reflects how vast the differences are between various systems and how very different their visitor segmentation capabilities are. Today’s topic – how segments get created – is simpler and less diverse.

Essentially, there are two main divisions in segmentation methodology. The first main divide is between sampling-based segmentation systems and those that use all of the data (comprehensive). The second issue is whether segments are created in real-time or delayed.

These aren’t necessarily either/or decisions. Some tool sets have various components and approaches that span pretty much every combination of these alternatives – and there are some pretty good reasons why each has a place.

Let’s start with sampled data vs. comprehensive data. To me, the issues here are pretty simple. Vendors provide sampled data for one simple reason – performance. It’s much easier to deliver fast answers against sampled data than it is against comprehensive data – especially if you’re talking about a large web site with tens or hundreds of millions of requests monthly.

In a perfect world, I suppose you’d like to have near instantaneous analysis against comprehensive data. But this isn’t, of course, a perfect world. And the real question is how much you lose when you employ a sampling methodology.

On the whole, I think sampling solutions are very viable. Sampling, done correctly, can almost always provide answers that are near-enough – especially given the built-in slop factor inherent in web analysis. New users of web analytic solutions are frequently (and rightly) put off by the fact that “nothing ever ties!” Rightly or wrongly, though, you get used to some level of imperfection. Indeed, I think one of the virtues of sampling is that it puts your expectations about the data in a firmly reasonable place.

Compared to some other methods of data trimming (like dropping infrequent paths), sampling is very much to be preferred. Sampling rarely distorts the data into unrecognizable forms – whereas data trimming will frequently do just that in situations where the data has a very long tail.

In addition, a great deal of customer segmentation is for purely analytic purposes – not to support management reporting. And for analytic purposes, the difference between sampled data and comprehensive data is quite often not important. This is also one of those times when it’s nice to be able to check samples against comprehensive data – to either validate conclusions or spot-check for cases where your sampled data is returning suspect answers.

Which brings us to the second main divide in segmentation methodologies – real-time vs. delayed segment creation. And it’s probably obvious that there is a deep relationship between these two issues. Real-time segmentation may be impossible without sampling – so one of the biggest potential benefits to sampling is enabling the analyst to make and report on segments without having to wait hours or days.

How big a deal is this? It’s actually pretty important. If you are using segments to support management reporting, you probably won’t care much about this. After all, if you’re going to be using a segment for the next couple of years, it doesn’t much matter if it takes a day or two to create. But most segmentation is for analytic purposes – and needs to be responsive to changing needs. What’s more, an analyst often doesn’t know if a segment is going to be useful. So if you have to wait a long time to view the results of a segment, it can make the cycle times on analysis frustratingly long. This is especially problematic if your system places caps on how many segments you can create (this is pretty common when segments are being built on a vendor’s data warehouse). We’ve more than once used up our quota of segments because of segment definition errors, mistakes in judgement and just plain wrong guesses about what might prove interesting. And believe me, it isn’t fun to tell a client you can’t finish an analysis because you can’t create the segments you NOW know you really need!

So here is the recap on segment methodology: ideally, you’d like to be able to build segments in real-time against comprehensive data. But, for analytic purposes, it’s much better to have real-time segmentation with sampled data than heavily delayed segments with comprehensive data. And, if you are doing serious analysis, it’s important to have either unlimited segmentation or a very large number of available segmentations. On the other hand, if you’re focused on segmentation for management reporting, then comprehensive data is much more important than real-time capability – and you probably won’t need as many available segmentations.

What’s more, while the perfect solution would be a single tool providing unlimited, comprehensive real-time segmentation, it isn’t much worse to have a suite of tools that offer a choice of real-time segmentation against sampled data and delayed segmentation against comprehensive data. This type of solution will still meet the needs of almost every situation quite admirably. And here’s the good news – more and more tools are supporting a rich set of visitor segmentation methods – enough to insure that you can do what you need quite of bit of the time. That’s a big change from a few years back and is one of the real bright spots in the web analytic toolspace.



Add to | Digg | Yahoo! My Web | Furl

Bookmark WebProNews:

Gary Angel is the author of the “SEMAngel blog – Web Analytics and Search Engine Marketing practices and perspectives from a 10-year experienced guru.