A/B tests are an excellent way to learn what works and doesn’t work for your website. But after running a test and measuring an effect, a common question is, “why did that happen?”. Maybe a test that sounded like a great idea actually created a downlift and you want to know what went wrong.
At this point many people reach for “post-test segmentation”, taking the test data and splitting it into different user segments, to see if the test had an unexpected effect on some group of users.
Be wary of multiple tests!
The biggest problem with this approach is something called “multiple testing”. Most people will be using a statistical model where their tests will always have a 5% chance of showing a positive result. This is quite simply due to random chance (a 5% false positive rate). There’s therefore a 5% chance of seeing a win in a test even if there’s no real effect. (Confused? Why not watch our helpful video: 5 things to do before your next A/B test? Here we explain this principle in full).
When we start applying this test to lots of segments of users we have a 5% chance of seeing a win for each segment we check. This means that if you look at 20 segments, you would expect to see an average of one segment (20 x 5%) with a statistically significant effect, even in tests which do nothing.
The more segments you look at, the more false positives you are likely to encounter.
Worse still, if a segment is too small, you probably don’t have enough data to see a real effect, even if one does exist. As we’ve talked about before, if you want to measure an effect in an A/B test, you need to make sure you collect enough data to have the statistical power to see it. Data only delivers insights when it is a large enough sample size. This is even more relevant here as many segments can represent very small groups of users.
These two effects together mean that the results of post-test segmentation are often pretty meaningless and can lead you to incorrect conclusions.
Doing it right...
Here are some things to think about if you want to look at test effects on segments of users:
1. Decide on your segment right away
Searching through lots of different types of segments after a test means you’re guaranteed to see a lot of positive results just due to random chance. It’s much better to think about what segments you’re interested in up front. For example, maybe you want to know if a test affects new and returning users differently, or if US visitors are different to European visitors. By defining the segments in advance you limit the number of tests and therefore the number of false positive results you will see.
2. Use a power calculator!
Use a power calculator to work out how many visitors you need to measure effects in the segments. For example if you saw you needed 100,000 visitors to run an A/B test, you’ll need around 100,000 visitors in a segment in order to measure an effect. This means that if users coming through tablets represent 20% of your traffic, you’re going to need around 5x more data to measure an effect involving tablet users. The number may be unrealistically high! So thinking about using different approach, such as user feedback, might help you understand your results better.
3. Get stricter with your stats
Use a stricter statistical test. If you are looking at the effect of your test on users from 10 different cities, use a test that has a 0.5% false positive rate, rather than 5%. Then you will have a 5% (10 cities x 0.5% chance of a false positive in each one) chance of seeing a false positive in the experiment overall. This will keep your total number of false positive results under control.
Phew… you made it!
In principle, breaking test results into segments offers insight into why tests have the effect they do. However, to do it properly, it will require significantly more data and stricter statistical tests than a standard A/B test. Many testing tools ignore these facts and produce highly misleading results. If you want to understand your user segments in detail, think about designing experiments explicitly for this purpose.
If you're interested in the statistical traps that you could fall into while A/B testing, why not read our whitepaper "Most winning A/B tests are illusory"?