Update: I’ve been told by James Lee at Google that “The reason the probability fluctuates is because Content Experiments is using Monte Carlo simulation to calculate the probabilities. It’s possible for there to be a slight variation because of the noise.” – so there you have it, very interesting in did!
I’m a big advocate of Google Analytics, and I’m fond of split testing – you can’t argue with the facts! But recently I had got into creating my own statistical analysis for split testing (not just for websites) and so when I found that the “Probability of Outperforming Original” figure on Google Analytics Experiments didn’t match my own I got a little suspicious.
Normally I’d just figure that I’ve made a mistake, but in this case I was matching the calculations given by lots of online calculators (including Visual Website Optimizer’s awesome A/B testing significance calculator).
So I started looking into it a bit more. I contacted Paras Chopra, Founder & CEO of Visual Website Optimizer, about the difference in Google’s calculations to see if he knew what they might be using to make their calculations. He kindly replied to me with the following:
Sorry for the late reply. I’m not sure how the new Google Content Experiments is calculating significance. But earlier (when it was GWO) they used the same statistical test. So without knowing what model they’re using and what assumptions they have made, it is very difficult to comment.
So it seemed that when it was Google Website Optimizer the figures would have matched up.
So after finding no information on the Google Help Centre I decided to enlist the help of a friend of mine, Helen, who is a bit of a whizz when it comes to statistics. She couldn’t see what they were using either!
Then yesterday I found some further information – I hadn’t seen it on the Google Help Centre before so I assumed it must be new information.
“Our adjustments are made using a statistical technique known as sequential Bayesian decision theory.”
So I sent the information with a screenshot to Helen, here’s the snippet of the test results I sent:
(it’s a bit blurred as I copied it from gmail and it must have resized it or something)
Then about 5 minutes later I refreshed the page, and guess what? The visit numbers hadn’t changed, the conversions hadn’t changed, but the “Probability of Outperforming Original” had changed – from 94.3% to 93.6%.
I immediately hit refresh again, think it was some kind of randomization – but no change. This now matched my figures exactly, so I figured that they must have fixed the bug, but just in case a few minutes later I hit refresh and it changed again:
Ok, so it’s just a few percentage points – but the scary thing is we don’t know what is causing it. I thought maybe it was a time related thing, but it didn’t change on one refresh, then I thought perhaps it was just a bug they were fixing – but they wouldn’t fix stuff on the fly like that.
Then it hit me – filters. I filter out my own ip address as well as the clients’, what if it’s counting visits from the profile filters in the calculation (even though I don’t believe it is meant to)? That’s my current theory anyway.
Would love to hear about this from someone at Google – until then you might want to stick to using more reliable split testing tools!
Just refreshed again and the probability has now changed to 93.1% – no change to visits/conversions.)