Monday, December 31, 2007

The DoubleClick bind

It's been an ongoing struggle to reconcile DoubleClick's (DFA) stats with what we get elsewhere, for example from Google Analytics, WebTrends SDC, and server log files. Most of us trust the data provided by the tagbased analytics and mistrust DoubleClick's numbers, but it would be nice to know what really causes the differences. And they are BIG differences, because DoubleClick's numbers (clicks) can be 100% higher than Google Analytics or WebTrends (page views).

It becomes an issue with people who are invested in spending money on interactive advertising. They are, justifiably, wanting to take the DoubleClick numbers at face value. It makes the clickthrough rate look a lot better.

We're trying to look into it, with a lot of help from WebTrends custom reports, server logs, SDC logs, and Google Analytics as a sort of backup.

We know this:

DFA's numbers are 'way higher than the tagbased data, as much as 100% higher.

DFA's numbers are very similar to the numbers in server logs.

The two tagbased reports, WebTrends SDC and Google Analytics, agree with each other but are usually only half the size of DFA stats and server logs.

When we carefully compare individual page view events between server logs and SDC (tagbased) logs, we can separate out the "extra" events that don't show up in the tagged logs. Close examination does NOT show a pattern of repeated User Agent or IP information. There's no obvious evidence of one or two entities doing a lot of clicking. No easy answer about bots or cottage industry clickfraud.

However, we are finding one interesting thing that points to a possible explanation, or at least a corroboration of the notion that DoubleClick's numbers contain a lot of non-humans. DoubleClick hits are, as you know, marked with extra parameters. So, a given banner destination page will have, in the logs, some hits with DC marker parameters (coming from banner clicks) and other hits without those marker parameters (coming from non-banner sources). We did separate analyses of these two groups - only analyzing hits that were also first hits in a visit.

We found that if the hits had DC marker parameters, the discrepancy between server and SDC/Google logs was a huge one - in the up-to-100%-more range. But if the hits did NOT have DC marker parameters, the discrepancy was really quite small --- server and SDC/Google more or less had the same numbers. In other words, the hits that appeared in server logs but did not appear in tag-based logs were mostly banner hits.

We all know that there are several reasons for an event to appear in server logs but not in tag logs. Two big ones are: 1) an entity that does not execute javascript will not appear in tag logs (rather, the identity of the page being requested won't, although the hit will appear as a site hit), and 2) an entity that does not request images will not appear in tag logs.

It looks to us like a big proportion of clicks on banners are by entities that don't request images or don't execute javascript - basically, bots.

Someday, I'll talk to an insider who understands what's going on and who runs these bots and why. Could be legitimate, or not.

In the meantime, I'm continuing to say that WebTrendsSDC/Google Analytics numbers for banner traffic are the ones to trust, because they are probably the human traffic. DoubleClick numbers contain vast amounts of non-human traffic.