Whether you’re an experienced entrepreneur or just getting started, there’s a good chance you’ve seen countless articles and resources about A/B testing. You might even already A/B test your email subject lines or your social media posts.
Despite the fact that there’s been plenty said about A/B testing in the field of marketing, a lot of people still get it wrong. The result? People making major business decisions based on inaccurate results from an improper test.
Click here to start selling online now with Shopify
A/B testing often is over simplified, especially in content written for store owners. Below you’ll find everything you need to know to get started with different types of A/B testing for ecommerce, explained as plainly as possible. A/B testing can be a game changer for choosing the right product positioning, increasing conversions on a landing page, and so much more.
What is A/B testing?
A/B testing, sometimes referred to as split testing, is the process of comparing two versions of the same web page, email, or other digital asset to determine which one performs better based on user behavior. It’s a useful tool for improving the performance of a marketing campaign and better understanding what converts your target audience.
This process allows you to answer important business questions, helps you generate more revenue from the traffic you already have, and sets the foundation for a data-informed marketing strategy.
Learn More: How to Conduct a SWOT Analysis for Your Business
How A/B testing works
When using A/B testing in the context of marketing, you show 50% of visitors version A of your asset (let’s call this the “control”), and 50% of visitors version B (let’s call this the “variant”).
The version that results in the highest conversion rate wins. For example, let’s say the variant (version B) yielded the highest conversion rate. You would then declare it the winner and push 100% of visitors to the variant.
Then, the variant becomes the new control, and you must design a new variant.
It’s worth mentioning that an A/B test conversion rate can often be an imperfect measure of success.
For example, if on one page you price an item for $50 and the other page it’s completely free, that’s not going to provide any truly valuable insight. As with any tool or strategy you use for your business, it has to be strategic.
That’s why you should track the value of a conversion all the way through to the final sale.
What’s A/B/n testing?
With A/B/n testing, you can test more than one variant against the control. So, instead of showing 50% of visitors the control and 50% of visitors the variant, you might show 25% of visitors the control, 25% the first variant, 25% the second variant, and 25% the third variant.
Note: This is different from multivariate testing, which also involves multiple variants. When running multivariate testing, you’re not only testing multiple variants, you’re testing multiple elements as well, such as A/B testing UX or SEO split testing. The goal is to figure out which combination performs best.
You’ll need a lot of traffic to run multivariate tests, so you can ignore those for now.
How long should A/B tests run?
Run your A/B test for at least one, ideally two, full business cycles. Don’t stop your test just because you’ve reached significance. You’ll also need to meet your predetermined sample size. Finally, don’t forget to run all tests in full-week increments.
Why two full business cycles? For starters:
- You can account for “I need to think about it” buyers.
- You can account for all of the different traffic sources (Facebook, email newsletter, organic search, etc.)
- You can account for anomalies. For example, your Friday email newsletter.
Two business cycles is generally enough time to get valuable insight into user behavior of your target audience.
If you’ve used any sort of A/B test landing page testing tool, you’re likely familiar with the little green “Statistically Significant” icon.
For many, unfortunately, that’s the universal sign for “the test is cooked, call it.” As you’ll learn more about below, just because A/B test statistical significance has been reached does not mean you should stop the test.
And your predetermined sample size? It’s not as intimidating as it seems. Open up a sample size calculator, like this one from Evan Miller, to reference across your web pages to help improve your conversion rates.
This calculation is saying that if your current conversion rate is 5% and you want to be able to detect a 15% effect, you need a sample of 13,533 per variation. So, in total, over 25,000 visitors are needed if it’s a standard A/B test.
Watch what happens if you want to detect a smaller effect:
All that’s changed is the minimum detectable effect (MDE). It’s decreased from 15% to 8%. In this case, you need a sample of 47,127 per variation. So, in total, nearly 100,000 visitors are needed if it’s a standard A/B test.
Whether you’re A/B testing UX or SEO split testing, your sample size should be calculated upfront, before your test starts. Your test can’t stop, even if it reaches significance, until the predetermined sample size is reached. If it does, the test isn’t valid.
This is why you can’t aimlessly follow best practices, like “stop after 100 conversions.”
It’s also important to run a split test for full-week increments. Your traffic can change based on the day of the week and the time of day, so you’ll want to be sure to include every day of the week.
Why should you A/B test?
Let’s say you spend $100 on Facebook ads to send 10 people to your site. Your average order value is $25. Eight of those visitors leave without buying anything and the other two spend $25 each. The result? You lost $50.
Now let’s say you spend $100 on Facebook ads to send 10 people to your site. Your average order value is still $25. This time, though, only five of those visitors leave without buying anything and the other five spend $25 each. The result? You made $25.
This is one of the more simple A/B testing examples, of course. But by increasing the conversion rate for your online store, you made the same traffic more valuable.
A/B testing images and copy also helps you uncover insights, whether your test wins or loses. This value is very transferable. For example, a copywriting insight from a product description A/B test could help inform your value proposition, a product video, or other product descriptions.
You also can’t ignore the inherent value of focusing on continuously improving the effectiveness of your online store.
Should you be A/B testing?
Not necessarily. If you’re running a low-traffic site or a web or mobile app, A/B testing is probably not the best optimization effort for you. You will likely see a higher return on investment (ROI) from conducting user testing or talking to your customers, for example.
Despite popular belief, conversion rate optimization does not begin and end with testing.
Consider the numbers from the sample size calculator above. 47,127 visitors per variation to detect an 8% effect if your baseline conversion rate is 5%. Let’s say you want to test a product page. Does it receive nearly 100,000 visitors in two to four weeks?
Why two to four weeks? Remember, we want to run tests for at least two full business cycles. Usually, that works out to two to four weeks. Now maybe you’re thinking, “No problem, I’ll run the test for longer than two to four weeks to reach the required sample size.” That won’t work either.
The longer a test is running, the more susceptible it is to external validity threats and sample pollution. For example, visitors might delete their cookies and end up re-entered into the A/B test as a new visitor. Or someone could switch from their mobile phone to desktop and see an alternate variation.
Essentially, letting your test run for too long is as bad as not letting it run long enough.
Testing is worth the investment for stores that can meet the required sample size in two to four weeks. Stores that can’t should consider other forms of optimization until their traffic increases.
Julia Starostenko, Product Manager at Pinterest, agrees, explaining:
What should you A/B test?
I can’t tell you what you should A/B test. I know, I know. It would certainly make your life easier if I could give you a list of 99 things to test right now. There’s no shortage of marketers willing to do that in exchange for clicks.
Truth is, the only tests worth running are tests based on your own data. I don’t have access to your data, your customers, etc., and neither does anyone curating those huge lists of A/B test ideas. None of us can meaningfully tell you what to test.
The only tests worth running are tests based on your own data.
Instead, I encourage you to answer this question for yourself through qualitative and quantitative analysis. Some popular A/B testing examples are:
- Technical analysis. Does your store load properly and quickly on every browser? On every device? You might have a shiny new iPhone 14, but someone somewhere is still rocking a Motorola Razr from 2005. If your site doesn’t work properly and quickly, it definitely doesn’t convert as well as it could.
- On-site surveys. These pop up as your store’s visitors browse around. For example, an on-site survey might ask visitors who have been on the same page for a while if there’s anything holding them back from making a purchase today. If so, what is it? You can use this qualitative data to improve your copy and conversion rate.
- Customer interviews. Nothing can replace getting on the phone and talking to your customers. Why did they choose your store over competing stores? What problem were they trying to solve when they arrived on your site? There are a million questions you could ask to get to the heart of who your customers are and why they really buy from you.
- Customer surveys. Customer surveys are full-length surveys that go out to people who have already made a purchase (as opposed to visitors). When designing a survey, you want to focus on: defining your customers, defining their problems, defining hesitations they had prior to purchasing, and identifying words and phrases they use to describe your store.
- Analytics analysis. Are your analytics tools tracking and reporting your data properly? That might sound silly, but you’d be surprised by how many analytics tools are configured incorrectly. Analytics analysis is all about figuring out how your visitors behave. For example, you might focus on the funnel. Where are your biggest conversion funnel leaks? In other words, where are most people dropping out of your funnel? That’s a good place to start testing.
- User testing. This is where you watch real people in a paid, controlled experiment try to perform tasks on your site. For example, you might ask them to find a video game in the $40 to $60 range and add it to their cart. While they’re performing these tasks, they narrate their thoughts and actions out loud.
- Session replays. Session replays are similar to user testing, but now you’re dealing with real people with real money and real intent to buy. You’ll watch as your actual visitors navigate your site. What do they have trouble finding? Where do they get frustrated? Where do they seem confused?
There are additional types of research as well, but start by choosing the best A/B testing methodology for you. If you run through some of them, you will have a huge laundry list of data-informed ideas worth testing. I guarantee your list will bring you more value than any “99 things to test right now” article ever could.
Prioritizing A/B test ideas
A huge list of A/B test ideas is exciting, but not exactly helpful for deciding what to test. Where do you start? That’s where prioritization comes in.
There are a few common prioritization frameworks you can use:
- ICE. ICE stands for impact, confidence, and ease. Each of those factors receives a 1–10 ranking. For example, if you could easily run the test by yourself without help from a developer or designer, you might give ease an eight. You’re using your judgment here, and if you have more than one person running tests, rankings may become too subjective. It helps to have a set of guidelines to keep everyone objective.
- PIE. PIE stands for potential, importance, and ease. Again, each factor receives a 1–10 ranking. For example, if the test will reach 90% of your traffic, you might give importance an eight. PIE is as subjective as ICE, so guidelines can be helpful for this framework as well.
- PXL. PXL is the prioritization framework from CXL. It’s a little bit different and more customizable, forcing more objective decisions. Instead of three factors, you’ll find Yes/No questions and an ease-of-implementation question. For example, the framework might ask: “Is the test designed to increase motivation?” If yes, it gets a 1. If no, it gets a 0. You can learn more about this framework and download a spreadsheet.
Now you have an idea of where to start, but it can also help to categorize your ideas. For example, during some conversion research I did recently, I used three categories: implement, investigate, and test.
- Implement. Just do it. It’s broken or obvious.
- Investigate. Requires extra thought to define the problem or narrow in on a solution.
- Test. The idea is sound and data informed. Test it!
Between this categorization and prioritization, you’re set.
A crash course in A/B testing statistics
Before you run a test, it’s important to dig into statistics. I know, statistics usually aren’t a fan favorite, but think of this as the required course you begrudging take to graduate.
Statistics is a big part of A/B testing. Fortunately, A/B testing tools and split testing software have made the job of an optimizer easier, but a basic understanding of what’s happening behind the scenes is crucial for analyzing your test results later on.
What is mean?
Mean is the average. Your goal is to find a mean that is representative of the whole.
For example, let’s say you’re trying to find the average price of video games. You’re not going to add the price of every video game in the world and divide it by the number of all the video games in the world. Instead, you’ll isolate a small sample that is representative of all of the video games in the world.
You might end up finding the average price of a couple hundred video games. If you’ve selected a representative sample, the mean price of those two hundred video games should be representative of all the video games in the world.
What is sampling?
The larger the sample size, the less variability there will be, which means the mean is more likely to be accurate.
So, if you increased your sample from 200 video games to 2,000 video games, you’d have less variance and a more precise mean.
What is variance?
Variance is the average variability. Essentially, the higher the variability, the less accurate the mean will be in predicting an individual data point.
So, how close is the mean to the actual price of each individual video game?
What is statistical significance?
Assuming there’s no difference between A and B, how often will you see the effect just by chance?
The lower the statistical significance level, the bigger the chance that your winning variation is not a winner at all.
Simply put, a low significance level means that there is a big chance your “winner” is not a real winner (this is known as a false positive).
Be aware that most A/B testing tools and open source A/B testing software call statistical significance without waiting for a predetermined sample size or point in time to be reached. That’s why you might notice your test flipping back and forth between statistically significant and statistically insignificant.
Peep Laja, founder of CXL, wants more people to really understand A/B test statistical significance and why it’s important:
What is regression to the mean?
You might notice extreme fluctuations at the beginning of your A/B test.
Regression to the mean is the phenomenon that says if something is extreme on its first measurement, it will likely be closer to the average on its second measurement.
If the only reason you’re calling a test is because it’s reached statistical significance, you could be seeing a false positive. Your winning variation will likely regress to the mean over time.
What is statistical power?
Assuming there’s a difference between A and B, how often will you see the effect?
The lower the power level, the bigger the chance that a winner will go unrecognized. The higher the power level, the lower the chance that a winner will go unrecognized. Really, all you’ll need to know is that 80% statistical power is standard for most A/B testing tools and/or any split-testing service.
Ton Wesseling, founder of Online Dialogue, wishes more people knew about statistical power:
What are external validity threats?
There are external factors that threaten the validity of your tests. For example:
- Black Friday Cyber Monday (BFCM) sales
- A positive or negative press mention
- A major paid campaign launch
- The day of the week
- The changing seasons
One of the more common A/B testing examples where external validity threats impact your results is during seasonal events. Say you were to run a test during December. Major shopping holidays would mean an increase in traffic for your store during that month. You might find in January that your December winner is no longer performing well.
Because of an external validity threat: the holidays.
The data you based your test decision on was an anomaly. When things settle down in January, you might be surprised to find your winner losing.
You can’t eliminate external validity threats, but you can mitigate them by running tests for full weeks (e.g., don’t start a test on a Monday and end it on a Friday), including different types of traffic (e.g., don’t test paid traffic exclusively and then roll out the results to every traffic source), and being mindful of potential threats.
How to set up an A/B test
Let’s walk through a little A/B testing tutorial. Before you test anything, you need to have a solid hypothesis. (Great, we just finished math class and now we’re on to science.) For example, “If I lower what I charge for shipping, conversion rates will increase.”
Don’t worry, it’s not complicated. Basically, you need to test a hypothesis, not an idea. A hypothesis is measurable, aspires to solve a specific conversion problem, and focuses on insights instead of wins.
You need to A/B test a hypothesis, not an idea.
Whenever I’m writing an hypothesis, I use a formula borrowed from Craig Sullivan’s Hypothesis Kit:
- Because you see[insert data/feedback from research]
- You expect that [change you’re testing] will cause [impact you anticipate] and
- You’ll measure this using [data metric]
Easy, right? All you have to do is fill in the blanks and your test idea has transformed into a hypothesis.
Choosing an A/B testing tool
All are good, safe options.
- Google Optimize. Free, save for some multivariate limitations, which shouldn’t really impact you if you’re just getting started. It works well when performing Google Analytics A/B testing, which is a plus.
- Optimizely. Easy to get minor tests up and running, even without technical skills. Stats Engine makes it easier to analyze test results. Typically, Optimizely is the most expensive option of the three.
- VWO. VWO has SmartStats to make analysis easier. Plus, it has a great WYSIWYG editor for beginners. Every VWO plan comes with heatmaps, on-site surveys, form analytics, etc.
We also have some A/B testing tools in the Shopify App Store that you might find helpful.
Once you’ve selected an A/B testing tool or split-testing software, fill out the sign-up form and follow the instructions provided. The process varies from tool to tool. Typically, though, you’ll be asked to install a snippet on your site and set goals.
How to analyze A/B test results
Remember when I said writing a hypothesis shifts the focus from wins to insights? Krista Seiden, Analytics Advocate former Product Manager at Google, explains what that means:
If you craft your hypothesis correctly, even a loser is a winner, because you’ll gain insights you can use for future tests and in other areas of your business. So, when you’re analyzing your test results, you need to focus on the insights, not whether the test won or lost. There’s always something to learn, always something to analyze. Don’t dismiss the losers!
If you craft your hypothesis correctly, even a loser is a winner.
The most important thing to note here is the need for segmentation. A test might be a loser overall, but chances are it performed well with at least one segment. What do I mean by segment?
- New visitors
- Returning visitors
- iOS visitors
- Android visitors
- Chrome visitors
- Safari visitors
- Desktop visitors
- Tablet visitors
- Organic search visitors
- Paid visitors
- Social media visitors
- Logged-in buyers
You get the idea, right?
When you’re looking at the results in your testing tool, you’re looking at the whole box of candies. What you need to do is separate candies so you can uncover deeper, segmented insights.
Odds are that the hypothesis was proven right among certain segments. That tells you something as well.
Analysis is about so much more than whether the test was a winner or a loser. Segment your data to find hidden insights below the surface.
A/B testing tools won’t do the analysis for you, so this is an important skill to develop over time.
How to archive past A/B tests
Let’s say you run your first test tomorrow. Two years from tomorrow, will you remember the details of that test? Not likely.
That’s why archiving your A/B testing results is important. Without a well-maintained archive, all those insights you’re gaining will be lost. Plus, I kid you not, it’s very easy to test the same thing twice if you’re not archiving.
There’s no “right” way to do this, though. You could use a tool like Effective Experiments, or you could use Excel. It’s really up to you, especially when you’re just getting started. Just make sure you’re keeping track of:
- The hypothesis
- Screenshots of the control and variation
- Whether it won or lost
- Insights gained through analysis
As you grow, you’ll thank yourself for keeping this archive. Not only will it help you, but new hires and advisers/stakeholders as well.
A/B testing processes of the pros
Now that you’ve been through a standard A/B testing tutorial, let’s take a look at the exact processes of pros from companies like Google and HubSpot.
My step-by-step process for web and app A/B testing starts with analysis—in my opinion, this is the core of any good testing program. In the analysis stage, the goal is to examine your analytics data, survey or UX data, or any other sources of customer insight you might have in order to understand where your opportunities for optimization are.
Once you have a good pipeline of ideas from the analysis stage, you can move on to hypothesize what might be going wrong and how you could potentially fix or improve these areas of optimization.
Next, it’s time to build and run your tests. Be sure to run them for a reasonable amount of time (I default to two weeks to ensure I’m accounting for week over week changes or anomalies), and when you have enough data, analyze your results to determine your winner.
It’s also important to take some time in this stage to analyze the losers as well—what can you learn from these variations?
Finally, and you may only reach this stage once you’ve spent time laying the groundwork for a solid optimization program, it’s time to look into personalization. This doesn’t necessarily require a fancy toolset but rather can come out of the data you have about your users.
Marketing personalization can be as easy as targeting the right content to the right locations or as complex as targeting based on individual user actions. Don’t jump in all at once on the personalization bit though. Be sure you spend enough time to get the basics right first.
Alex Birkett, Omniscient Digital
At a high level, I try to follow this process:
- Collect data and make sure analytics implementations are accurate.
- Analyze data and find insights.
- Turn insights into hypotheses.
- Prioritize based on impact and ease, and maximize allocation of resources (especially technical resources).
- Run a test (following statistics best practices to the best of my knowledge and ability).
- Analyze results and implement or not according to the results.
- Iterate based on findings, and repeat.
Put more simply: research, test, analyze, repeat.
While this process can deviate or change based on what the context is (Am I testing a business-critical product feature? A blog post CTA? What’s the risk profile and balance of innovation vs. risk mitigation?), it’s pretty applicable to any size or type of company.
The point is this process is agile, but it also collects enough data, both qualitative customer feedback and quantitative analytics, to be able to come up with better test ideas and better prioritize them so you can drive traffic to your online store.
Ton Wesseling, Online Dialogue
The first question we always answer when we want to optimize a customer journey is: Where does this product or service fit on the ROAR model we created at Online Dialogue? Are you still in the risk phase, where we could do lots of research but can’t validate our findings through A/B test online experiments (below 1,000 conversions per month), or are you in the optimization phase? Or even above?
- Risk phase: lots of research, which will be translated into anything from a business model pivot to a whole new design and value proposition.
- Optimization phase: large experiments that will optimize the value proposition and the business model.
- Optimization phase: small experiments to validate user behavior hypotheses, which will build up knowledge for larger design changes.
- Automation: you still have experimentation power (visitors) left, meaning your full test potential is not needed to validate your user journey. What’s left should be used to exploit, to grow faster now (without focus on long-term learnings). This could be automated by running bandits/using algorithms.
- Re-think: you stop adding lots of research, unless it’s a pivot to something new.
So web or app A/B testing is only a big thing in the optimization phase of ROAR and beyond (until re-think).
Our approach to running experiments is the FACT & ACT model:
The research we do is based on our 5V Model:
We gather all these insights to come up with a main research-backed hypothesis, which will lead to sub-hypotheses that will be prioritized based on the data gathered through either desktop or mobile A/B testing. The higher the chance of the hypothesis being true, the higher it will be ranked.
Once we learn if our hypothesis is true or false, we can start combining learnings and take bigger steps by redesigning/realigning larger parts of the customer journey. However, at some point, all winning implementations will lead to a local maximum. Then you need to take a bigger step to be able to reach a potential global maximum.
And, of course, the main learnings will be spread throughout the company, which leads to all sorts of broader optimization and innovation based on your validated first-party insights.
Are you marketing to an international audience? Learn how to make that process easy with pseudo-localization.
Julia Starostenko, Pinterest
The purpose of an experiment is to validate that making changes to an existing webpage will have a positive impact on the business.
Before getting started, it’s important to determine if running an experiment is truly necessary. Consider the following scenario: there is a button with an extremely low click rate. It would be near impossible to decrease the performance of this button. Validating the effectiveness of a proposed change to the button (i.e., running an experiment) is therefore not necessary.
Similarly, if the proposed change to the button is small, it probably isn’t worth spending the time setting up, executing, and tearing down an experiment. In this case, the changes should just be rolled out to everyone and performance of the button can be monitored.
If it is determined that running an experiment would in fact be beneficial, the next step is to define the business metrics that should be improved (e.g., increase the conversion rate of a button). Then we ensure that proper data collection is in place.
Once this is complete, the audience is randomly run, split testing between two groups: one group is shown the existing version of the button while the other group gets the new version. The conversion rate of each audience is monitored, and once statistical significance is reached, the results of the experiment are determined.
Peep Laja, CXL
A/B testing is a part of a bigger conversion optimization picture. In my opinion it’s 80% about the research and only 20% about testing. Conversion research will help you determine what to test to begin with.
My process typically looks like this (a simplified summary):
- Conduct conversion research using a framework like ResearchXL to identify issues on your site.
- Pick a high priority issue (one that affects a large portion of users and is a severe issue), and brainstorm as many solutions to this problem as you can. Inform your ideation process with your conversion research insights. Determine which device you want to run the test on (you need to run mobile A/B testing separate from desktop).
- Determine how many variations you can test (based on your traffic/transaction level), and then pick your best one to two ideas for a solution to test against control.
- Wireframe the exact treatments (write the copy, make the design changes, etc.) Depending on the scope of changes, you might also need to include a designer to design new elements.
- Have your front-end developer implement the treatments in your testing tool. Set up necessary integrations (Google Analytics) and set appropriate goals.
- Conduct QA on the test (broken tests are by far the biggest A/B testing killer) to make sure it works with every browser/device combo.
- Launch the test!
- Once the test is done, conduct post-test analysis.
- Depending on the outcome either implement the winner, iterate on the treatments, or go and test something else.
Optimize A/B testing for your business
You have the process, you have the power! So, get out there, get the best A/B testing software, and start testing your store. Before you know it, those insights will add up to more money in the Bank of You.
If you want to continue learning about optimization, consider taking a free course, such as Udacity’s A/B testing by Google. You can learn more about web and mobile app A/B testing to boost your optimization skill set.
A/B testing FAQ
What is A/B testing?
At the most basic level, A/B testing is testing two versions of something to see which performs better. You can A/B test a variety of things related to your business, including social media posts, content, email, and product pages.
What’s an example of A/B testing?
An example of A/B testing would be running paid traffic to two slightly different product pages to see which page has the highest conversion rate.
An example of A/B testing would be running paid traffic to two slightly different product pages to see which page has the highest conversion rate.
To ensure your A/B testings can provide valuable insight, it’s recommended that you have traffic of more than 5,000 visitors to a given page.