I’ve spent the last 6 years of my life heavily involved in A/B testing, and other testing methodologies. Whether it was the performance of an email campaign to drive health outcomes, product changes, Website changes, the example list goes on. A few of these tests have been full factorial MVT tests (my fave).
I wanted to share some testing best practices and examples in marketing, so that you can feel confident about how you’re designing and thinking about A/B testing.
As a Data Scientist, you may be expected to be the subject matter expert on how to test correctly. Or it may be that you’ve just built a product recommendation engine (or some other model), and you want to see how much better you’re performing compared to the previously used model or business logic, so you’ll test the new model vs. whatever is currently in production.
There is SO MUCH more to the world of testing than is contained here, but what I’m looking to cover here is:
- Determining test and control populations
- Scoping the test ahead of launch
- A test design that will allow us to read the results we’re hoping to measure
- Test Analysis
- Thoughts on automating test analysis
Choosing Test and Control Populations
This is where the magic starts. The only way to determine a causal relationship is by having randomized populations (and a correct test design). So it’s imperative that our populations are drawn correctly if we want to learn anything from our A/B test. In general, the population you want to target will be specific to what you’re testing. If this is a site test for an Ecommerce company, you hope that visitors are randomized to test and control upon visiting the website. If you’re running an email campaign or some other type of test, then you’ll pull all of the relevant customers/people from a database or BigData environment who meet the criteria for being involved in your A/B test.
If this is a large list you’ll probably want to take a random sample of customers over some time period. This is called a simple random sample. A simple random sample is a subset of your population, where every member had an equal probability of being chosen to be in the sample.
Here is a great example on how to pull a random sample from Hive: here
Also, just to be clear, writing a “select top 1000 * from table” in SQL is NOT A RANDOM SAMPLE. There are a couple different ways to get a random sample in SQL, but how to do it will depend on the “flavor” of SQL you’re using.
Here is an example pulling a random sample in SQL server: here
Now that you have your sample, you’ll randomly assign these people to test and control groups.
There are times when we’ll need to be a little more sophisticated….
Let’s say that the marketing team wants to learn about ability to drive engagement by industry (and that you have industry data). Some of the industries are probably going to contain fewer members than others. Meaning that if you just split a portion of your population into two groups, you might not have a high enough sample size in certain industries that you care about to determine statistical significance.
Rather than putting in all the effort running the A/B test to the find out that you can’t learn about an industry you care about, use stratified sampling (This would involve doing a simple random sample within each group of interest).
Scoping Ahead of Launch
I’ve seen in practice when the marketing team doesn’t see the results they want say “We’re going to let this A/B test run for two more weeks to see what happens”. Especially for site tests, if you run anything long enough, tiny effect sizes can become statistically significant. You should have an idea of how much traffic you’re getting to the particular webpage, and how long the A/B test should run before you launch. Otherwise, what is to stop us from just running the A/B test until we get the result that we want?
Sit down with marketing and other stakeholders before the launch of the A/B test to understand the business implications, what they’re hoping to learn, who they’re testing, and how they’re testing. In my experience, everyone is set up for success when you’re viewed as a thought partner in helping to construct the test design, and have agreed upon the scope of the analysis ahead of launch.
For each cell in an A/B test, you can only make ONE change. For instance, if we have:
- Cell A: $15 price point
- Cell B: $25 price point
- Cell C: UI change and a $30 price point
You just lost valuable information. Adding a UI change AND a different price option makes it impossible to parse out what effect was due to the UI change or the $30 price point. We’ll only know how that cell performed in aggregate.
Iterative A/B testing is when you take the winner from one test and make it the control for a subsequent A/B test. This method is going to result in a loss of information. What if the combination of the loser from test 1 and the winner from test 2 is actually the winner? We’d never know!
Sometimes iterating like this makes sense (maybe you don’t have enough traffic for more test cells), but we’d want to talk about all potential concessions ahead of time.
Another type of test design is MVT (Multivariate). Here we’ll look at a full-factorial MVT. There are more types of multivariate tests, but full-factorial is the easiest to analyze.
- MVT is better for more subtle optimizations (A/B testing should be used if you think the test will have a huge impact)
- Rule of thumb is at least 100,000 unique visitors per month.
- You’ll need to know how to use ANOVA to analyze (I will provide a follow-up article with code and explanation for how to do this analysis and link it here later)
One illustrative example of an MVT test is below. The left (below) is the control experiences, and on the right are the 3 test treatments. This results in 2^3 = 8 treatments, because we’ll look at each possible combination of test and control.
On the left: The controls would be the current experience
On the right: Cell A could be new photography (ex: friendly waving stick figure), Cell B could reference a sale and, Cell C could show new content.
We can learn about all the interactions! Understanding the interactions and finding the optimal treatment when changing multiple items is the big benefit of MVT testing. The chart below shows you how each person would be assigned to one of the 8 treatments in this example.
In a future article I’ll write up one of my previous MVT tests that I’ve analyzed, with R code.
A/B Test Analysis
One of the most important parts of test analysis is to have consistency across the business in how we analyze tests. You don’t want to say something had a causal effect, when if another person had analyzed the same test, they might have reached a different conclusion. In addition to having consistent ways of determining conclusions, you’ll also want to have a consistent way of communicating these results with the rest of the business. For example, “Do we share results we find with a p-value greater than .05?” Maybe we do, maybe we don’t, but make sure the whole team is being consistent in their communication with marketing and other teams.
Confidence intervals should always be given! You don’t want to say “Wow! This is worth $700k a year”, when really it’s worth somewhere between $100k and $1.3m. That’s a big difference and could have an impact on decisions whether to roll out the change or not.
Let’s Automate our A/B Test Analysis!
Why spend multiple hours analyzing each A/B test, when we can:
- Automate removal of outliers
- Build in not calculating statistical significance if the sample is not quite large enough yet
- Determine statistical significance of metrics with confidence intervals and engaging graphs
- See how A/B tests are performing soon after launch to make sure there aren’t any bugs messing with our results or large drops in revenue.
- This also reduces opportunity for error in analysis
With a couple data entries and button pushes!
This would take a while to build, and will not be a one size fits all for all of your tests. Automating even a portion could greatly reduce the amount of time spent analyzing tests!
I hope this article gave you some things to be on the lookout for when testing. If you’re still in school to become a Data Scientist, taking a general statistics class that covers which statistics to use and how to calculate confidence intervals is something that will benefit you throughout your career in Data Science. Otherwise, there is certainly tons of information on the internet to give you an overview of how to calculate these statistics. I personally prefer Coursera, because it’s nice to sit back and watch videos on the content, knowing that the content is from well known universities.
You can learn a ton through properly executed testing. Happy learning!