Many teams believe they know their product and users well, yet making decisions based on intuition or “expert opinions” does not protect them from mistakes. Relying on facts significantly reduces this risk.
Product development resembles a scientific study: you formulate a hypothesis, run an experiment and draw conclusions from the data. In digital products, the A/B test is the key way to validate hypotheses.
This comprehensive guide covers the essentials of A/B tests: why they are necessary, and common pitfalls to avoid along the way.
Contents
- Definition of an A/B test
- Why A/B tests matter in product work
- Other types of tests
- How to run A/B tests: a step‑by‑step guide
- Examples of product improvements through A/B testing
- Experiments where you make your product worse—a fast way to validate hypotheses
- The peeking problem in A/B tests
- Mistakes and pitfalls in A/B testing
Definition of an A/B test
In product management an A/B test is a type of experiment that lets you measure the impact of a change by comparing two versions of a product.
Users are randomly divided into two groups: the control group (A) sees the current version, while the test group (B) sees the modified version. After running the experiment, you analyze predefined key metrics to understand how the change affected user behavior.
How A/B testing evolved in tech?
The roots of A/B testing lie in scientific and medical practice. The first known controlled experiment was conducted in 1747 by the Scottish surgeon James Lind, who divided sick sailors into groups and prescribed different treatments. Ultimately, only the group receiving citrus fruit improved. But widespread use of randomized controlled trials began in the 20th century, when the British statistician Ronald Fisher articulated their core principles. In 1948 the first trial of this kind measured the effectiveness of streptomycin against tuberculosis.
Over time, controlled experiments moved beyond science. By the 1950s marketers were using them. With the rise of the internet the method found its way into digital products: first in e‑commerce and online services, then across the broader tech industry. In the early 2000s Google and Microsoft started using A/B testing at scale to improve user experience and revenue.
Why A/B tests matter in product work
You can make product decisions without A/B tests—relying instead on intuition, expert judgement, market trends, competitor successes or stakeholder opinions. Sometimes this works, but even then you may not know why it worked.
A/B tests solve that problem. They allow you to isolate a change and understand whether it actually caused the observed effect. Instead of altering the entire product, you can experiment on a small part of it and attribute any impact to that specific change. In short, A/B testing helps you act on facts and replace guesswork with evidence‑based decisions.
What A/B tests give to the product team and the business
- Reduce risk and error: Validate hypotheses on a small sample before a full rollout, minimizing the likelihood of harming the business.
- Optimise the user experience: Identify which changes make the product more convenient (e.g., design tweaks, buttons, forms, content).
- Find working solutions to grow metrics: Increase conversion, average order value, retention and other key indicators (KPIs) based on confirmed data.
- Improve the product iteratively: The cumulative effect of continuous, small improvements produces a significant positive impact over time.
- Save time and resources: Instead of investing heavily in a feature that may not “take off,” test the idea first. This is especially valuable when budgets are tight and teams are overloaded.
- Build product culture: Teams that run experiments regularly learn to think in hypotheses, justify decisions, and not fear mistakes—because they have a safe framework to test ideas.
Companies like Google, Amazon, Netflix and Booking.com run thousands of A/B experiments a year, testing even tiny changes. For them experiments are an integral part of product culture.
Other types of tests
A/B testing is the most common method, but there are other ways to conduct experiments:
- A/A test: Both groups receive identical versions of the product to verify that the testing infrastructure works correctly. If significant differences appear between two identical variants, it may indicate data tracking issues, problems in group assignment (often called Sample Ratio Mismatch), or other errors. Running an A/A test helps surface and fix such issues before you start the real A/B testing.
- A/A/B test: A variant of the A/B test with two control groups (A and A) and one test group (B). This allows you to check whether the testing setup is consistent; if the two “A” groups show notable differences during the experiment, you know something is broken.
- Switchback experiments: The product versions alternate over specific time windows for an entire population or location. For example, cohort X (e.g., representing a specific city) will see version A of the product on one day and version B on the next. This approach is used when users can influence one another (as in social products or network‑effect platforms) and helps minimize cross‑group interference.
- Multivariate testing (MVT): Multiple elements are changed simultaneously (e.g. the button colour and the text) to find the best combination. MVT requires a large user base because the number of possible combinations multiplies with each added element, meaning each variant receives a smaller slice of traffic.
In this guide we focus on A/B tests. Once you understand the classic A/B test, learning other experiment types becomes much easier.
How to run A/B tests: a step‑by‑step guide
Like scientific experiments, A/B tests require clear hypotheses, thoughtful design, careful execution and proper interpretation of results. Before we dive into the steps, let’s define some key concepts used in the process.
Control and experimental groups
In A/B testing users are divided into two groups:
- Control group (A): receives the current version of the product or feature.
- Test group (B): receives the modified version that needs evaluation.
Comparing results across the groups shows the impact of the change.
Important points:
- Participants in each group should not overlap. They must belong to the same segment and be solving the same problem in the product to ensure clean, unbiased results.
- Users should be randomly assigned to the control and experimental groups.
Defining success metrics
Before starting the test you must specify which indicators will be measured to gauge the effectiveness of the change. Common examples include:
- Conversion: Percentage of users who complete a target action (e.g., a purchase).
- Engagement: Time spent on the site, number of pages viewed.
- Retention: Share of users who return to the product after a given period.
These are just examples. Your success metric should tie back to your specific hypothesis.
P‑value and statistical significance
The p‑value is a statistical indicator that helps you decide whether differences between groups are random or meaningful. A low p‑value (typically below 0.05) suggests that the differences are statistically significant and unlikely to be due to random chance.
Step 1: Define the subject and goals of the test
In reality, you can analyze any metric as long as it is relevant to the task. For recommendation tests you might look at click‑through rates on items; in a mobile banking experiment you might measure the time to complete a target action. In media products you might track the percentage of articles read to the end, and so on.
Important: In A/B tests, you should change only one parameter at a time so you can pinpoint what affected the result. The user experience for both groups must remain identical in all other respects.
Before launching an A/B test, formulate the hypothesis clearly using the pattern:
If we change [specific element], it will improve [metric] by X %.
For example:
- Too vague: “Update the payment page design.”
- Concrete and measurable: “If we change the ‘Buy’ button colour from grey to orange, conversion will increase by 5 %.”
What metrics can you influence through an A/B test?
- Conversion (e.g., percentage of users who place an order).
- Clicks (on a button, link or ad banner).
- Engagement (time on site, number of pages viewed).
- Retention (how many people return after N days).
These illustrate the kind of metrics linked to the hypothesis.
Step 2: Choose a tool
Running an A/B test requires specialized tools. Your choice depends on factors such as the team’s technical capabilities, platform (mobile vs. web), budget, company size, testing goals, and integration with existing systems. Below is a selection of common solutions (note that this list is not exhaustive).
For web pages
| Name | Website | Comment |
| Optimizely | optimizely.com | One of the most powerful tools for client‑side and server‑side A/B tests. |
| VWO | vwo.com | Offers a visual editor, analytics, heat maps and more in one interface. |
| Zoho PageSense | zoho.com/pagesense/ | An affordable solution for small and medium‑sized businesses with wide functionality. |
| Crazy Egg | crazyegg.com | Suitable for small businesses; includes heat maps and simple tests. |
| Unbounce | unbounce.com | Code‑free landing‑page creation and testing—great for marketers. |
| Convertize | convertize.com | Intuitive visual editor; convenient for marketing teams. |
| Adobe Target | adobe.com/target | Enterprise‑level solution with advanced personalisation and AI optimisation. |
| Kameleoon | kameleoon.com | Advanced A/B and multivariate testing for large projects. |
| SiteSpect | sitespect.com | Testing without JavaScript—suited for high‑traffic sites; server‑side access. |
For mobile applications
| Name | Website | Comment |
| Firebase A/B Testing | firebase.google.com | Free tool from Google, easily integrates with other Firebase services. |
| Airship | airship.com | Supports both visual and server‑side tests, including feature flags. |
| UXCam | uxcam.com | Provides visual analytics, heat maps, session recordings and A/B experiments. |
| Harness | harness.io | Server‑side testing focused on engineering teams. |
| LaunchDarkly | launchdarkly.com | Feature‑flag‑based experiments, suited for large mobile products. |
| CleverTap | clevertap.com | Combines A/B tests with marketing communications, including push notifications. |
| Dynatrace | dynatrace.com | Full growth platform: A/B tests, analytics, feature flags. |
| Mixpanel | mixpanel.com | Primarily behavioral analytics, but includes basic experimentation tools. |
| Kameleoon Mobile | kameleoon.com | Native‑app solution fully integrated with the main Kameleoon platform. |
Many of these tools offer free trials so you can evaluate their features before purchasing. Be sure to check each service’s terms—some require a subscription to access results. If you join an established product team, existing tools are likely already in place, so selecting a tool is more relevant for new products or startups.
Step 3: Design the experiment
Once you have defined the hypothesis and metrics, move on to the technical design.
- Define the test parameters
- Minimum Detectable Effect (MDE): Decide how big a difference between the groups must be for you to deem it important.
- Significance level (α) and Test power (1 – β): Determine your risk tolerance. The significance level defines how much error you can tolerate if you mistake random noise for a real effect (a false positive). Test power dictates how confident you want to be in successfully detecting a real effect when one actually exists (avoiding a false negative).
- Choose a statistical test: Select the appropriate mathematical method for your data (for example, using a t-test to compare means).
- Calculate sample size: Determine how many users are needed to detect a difference if one exists. A highly recommended industry tool for this is Evan Miller’s Sample Size Calculator.
Set up randomization: Ensure users are randomly assigned to control and test groups, typically 50/50, to avoid bias.
Step 4: Run the experiment – important process aspects
Commit to a fixed testing period: Decide how long the experiment will last based on the sample‑size calculation before you launch. Avoid changing the test duration mid‑experiment to prevent bias.
How long should you run the test?
Test duration depends on several factors:
- Sample size requirements: To get a reliable result, you must collect enough data. This depends on the difference you want to detect, and the desired accuracy.
- Seasonality and activity cycles: The test period should cover a full cycle of user activity (e.g., a week) to smooth out daily and weekly fluctuations (e.g., weekend vs. weekday traffic).
- Platform recommendations: As a general baseline, platforms like Facebook suggests running tests for at least seven days but no longer than 30 days to gather sufficient information.
Beware the “peeking problem”: Peeking means prematurely analyzing interim results before the test period is finished. Doing this inflates the risk of a Type I error (false positive) and leads to wrong decisions. We will discuss the peeking problem in detail later.
Monitor technical metrics: Ensure that both product variants work correctly so that technical issues do not skew results. Confirm that users are evenly split between control and test groups.
Step 5: Analyse the results
Check statistical significance
Not every observed difference between variants is meaningful. Sometimes differences arise by chance rather than true effect. You must evaluate statistical significance to ensure the effect is linked to the change.
Many online tools make it easy to assess the statistical significance of A/B tests, such as SurveyMonkey’s A/B testing calculator for statistical significance. Many A/B testing platforms also provide built-in significance calculators.
For those who want to understand the math behind the process:
If you want to dive deeper into the statistical methods used in analysing A/B tests, it is crucial to understand the p‑value. A p‑value represents the probability that the observed differences between groups arose by chance.
A p‑value is a statistic that helps determine how well the experimental results align with the null hypothesis. The null hypothesis assumes there is no effect or difference between the groups being compared. The p‑value shows the likelihood of obtaining the observed results, or more extreme ones, if the null hypothesis is true.
The formula for calculating a p‑value depends on the chosen statistical test and the data distribution. In general, it is computed as the area under the probability density curve of the test statistic, starting from the observed value and continuing into the more extreme tail. The larger this area, the higher the p‑value and the less significant the test results.
A common threshold is p = 0.05. This means that if the p‑value is below 0.05, the differences are considered statistically significant.
Suppose we test a new version of a web page and want to know if it affects conversion compared with the current version. The null hypothesis states that there is no difference. After the test, we get a p‑value of 0.03, meaning there is a 3 % chance of seeing these or more extreme results assuming there is no real effect.
Key points to remember:
— The p‑value does not indicate the size of the effect; it only reflects the probability of its existence.
— A low p‑value suggests a low probability that the results occurred by chance, but it doesn’t guarantee that the effect is practically meaningful for the business.
— A high p‑value indicates that the observed differences could have occurred randomly, but it does not prove that there is no effect (it simply fails to prove there is one).
Account for external factors
Consider outside influences such as seasonality, marketing campaigns, or news events. When the test uses two equal cohorts under identical conditions, external factors should impact both groups equally. However, you must still be aware of parallel product or marketing changes during the experiment, as they could affect user behavior and distort the result.
Confidence intervals
Confidence intervals (CI) indicate the range within which the true value of a metric likely falls (typically calculated at 95 %). For example, if an A/B test yields a conversion rate of 5 % ± 1 % (95 % CI: 4–6 %), there is a 95 % probability that the true conversion rate lies within that range.
Note that if the confidence interval for the difference between variants does not include zero, the effect is statistically significant.
Dealing with noise and false positives
Noise (random fluctuations or errors unrelated to the change) and false positives can distort the interpretation of a test. To mitigate these:
- Correct for multiple comparisons: Running many tests or tracking many metrics increases the risk of false positives. Statistical adjustments like the Bonferroni correction help control this risk.
Run A/A tests: Testing two identical variants helps gauge baseline noise and false‑positive rates.
Step 6: Make decisions
The results of A/B tests provide valuable input for decisions, but they are not orders to act. Even when a statistically significant positive effect is found, consider the context and other factors before rolling out the change.
Possible post-test actions include:
- Analyze and adjust the hypothesis: If results don’t meet expectations, revisit assumptions and explore alternative solutions.
- Decline to roll out: If the potential benefit does not justify the costs or risks, it may be wise not to implement the change.
- Execute a limited rollout: Even after achieving positive results, it can be useful to introduce the change to a small share of users first to see whether it holds up in real‑world conditions.
Analyze side effects: A change could improve the main metric but hurt others. Before rollout, check that secondary metrics and user experience are not adversely affected.
Examples of product improvements through A/B testing
Case 1: Optimising Netflix cover art
Problem: Netflix found that users often decide what to watch based on cover art. Standard covers did not always attract attention or motivate viewing.
Hypothesis: Personalizing covers based on user preferences would increase engagement and total time spent in the product by 5%.
Testing: Netflix built a system that automatically selected the cover most aligned with each user’s interests—for example, showing comedy‑themed covers to users who frequently watched comedies.
Result: The personalized covers significantly increased clicks and viewing time.
Case 2: Improving Amazon recommendations
Problem: Amazon’s recommendation system sometimes failed to suggest relevant products, reducing conversion and customer satisfaction.
Hypothesis: Implementing a new machine‑learning model for personalised recommendations would improve accuracy and boost sales.
Testing: Amazon ran an A/B test comparing the current recommendation system with the new model. Users were randomly split: the control group received the old recommendations, while the test group received the new ones.
Result: The new model increased click‑through on recommendations and conversion. Amazon rolled out the updated personalisation system to all users.
Case 3: Optimising Spotify’s landing page
Problem: Spotify aimed to increase the number of premium subscriptions by providing highly relevant content for users coming from specific search queries.
Hypothesis: Creating customized landing pages tailored to users’ interests would increase conversion to premium.
Testing: Spotify ran an experiment targeting users in Germany who searched for “audiobooks” and clicked on an ad. Half were directed to the standard generic page, and the other half to a specially designed page focused on audiobooks.
Result: The customized page increased premium subscriptions by 24%.
Experiments where you make your product worse—a quick way to validate hypotheses
Experiments where you make your product worse is an effective but seldom‑used tool. Product teams sometimes consider a factor important without objective evidence. For example:
- “We need to speed up the app, otherwise users will leave.”
- “If we stop sending push notifications, retention will drop.”
- “Fast support response is key to user satisfaction.”
Instead of investing resources in improving something (which may be costly), a degradation test lets you check how much that factor actually affects key metrics.
Principle: Intentionally make an aspect of the product worse and measure its impact on user behavior.
Scenario: A company wants to invest in accelerating its app because it assumes faster load times will increase retention and conversion.
A/B test: Instead of immediately spending money on optimization, the team slows down the app (for example by 1, 2 and 3 seconds) and observes how key metrics change.
Possible result: The key metrics do not change, meaning that investing resources in speeding up the app is not a priority at this stage.
Businesses worry that making the product worse will drive users away or provoke negative reviews. But if a degradation test is designed properly, it affects only a small share of the audience and runs for a limited time.
The peeking problem in A/B testing
What is the peeking problem?
One of the most common and dangerous mistakes in A/B testing is peeking: reviewing interim results too soon. Teams may decide to stop a test when they see a significant difference, without waiting for the required sample size to be reached.
If you stop the test at the first sign of significance, you risk making a false conclusion.
Why does peeking distort results?
The phenomenon of random fluctuations
If a test shows a significant difference very quickly, it is likely to be a fluke. Even when there is no real difference between variants A and B, metrics will still oscillate by chance. Over the course of a long test, the difference between the groups will occasionally cross the threshold of statistical significance even if there is no true effect.
Thus, if you check the results too frequently, standard statistical methods start producing false positives.
How can you avoid this error?
- Fix the sample size in advance: Before starting, calculate how many users you need to detect a meaningful difference. For example, if you need 100,000 users, don’t analyze the test until that amount of data is collected.
- Use sequential testing: This method allows you to adapt the test dynamically without inflating the p‑value. Sequential testing lets you analyze data as it comes in and decide to finish the experiment earlier than planned if the results become statistically significant. Google and Optimizely use Sequential Experiment Design, which adjusts statistical calculations to account for frequent checks.
- Use a Bayesian approach.
In statistics there are two main ways to evaluate probabilities: frequentist and Bayesian.
- The frequentist method estimates probability based on the frequency of events across many repetitions. For example, if a coin lands heads 50 times out of 100 throws, you might say the probability of heads is 50%.
- The Bayesian method treats probability as a degree of belief that is updated as new information arrives. Suppose you know most of the candies in a box are chocolate, so if you pick one at random you expect it to be chocolate. If you later learn that caramel candies are also in the box, you update your confidence in picking a chocolate.
- The frequentist method estimates probability based on the frequency of events across many repetitions. For example, if a coin lands heads 50 times out of 100 throws, you might say the probability of heads is 50%.
- In A/B testing, Bayesian methods allow you to factor in uncertainty and update the probability of a variant’s success in real time as new data arrives. Unlike the frequentist approach, Bayesian A/B tests explicitly model uncertainty and adjust the estimated probability of a winning variant as more data becomes available. However, Bayesian methods are also susceptible to peeking if not used properly.
- Don’t make a decision on the first sign of significance.
If the difference becomes significant, keep the test running until the planned end. If the difference remains stable for several days, you can then draw more confident conclusions.
Errors and traps in A/B testing
Prematurely ending the test
The team sees a promising difference and rushes to stop the experiment before enough data is collected.
Risk: High chance of a false positive. What looks like success on day 3 may look very different on day 10.
What to do: Define the sample size and test duration in advance and avoid acting before the test ends.
Lack of a clear hypothesis
The test is launched without a clear understanding of what is being tested and what outcome counts as success.
Ignoring the novelty effect
A short‑term spike in interest in a new element is mistaken for sustained growth. After the test ends, the metric “reverts” and the effect disappears.
What to do: Analyze behavior over time and check whether the effect is stable during the test.
Multiple testing and p‑hacking
Running dozens of tests at once or analyzing many metrics without adjustments increases the chance of finding “significant” results by luck.
What to do:
- Predefine the hypothesis, metric and variant being tested.
- Apply corrections for multiple comparisons (e.g. Bonferroni).
- Avoid searching for a positive result after the fact—this is p‑hacking.
Sample ratio mismatch (users crossing between groups)
The same user appears in both the control and test groups (for example, via different devices or due to technical bugs). This can distort data and blur the effect.
What to do:
- Tie each user rigidly to one version (using a user ID, cookie, etc.).
- Check the user ratio in the groups; it should be close to the assigned distribution (e.g. 50/50).
Influence of hidden variables and seasonality
External factors such as holidays, marketing campaigns or news events can distort results and lead you to attribute an effect to the change.
What to do:
- Plan tests outside unstable periods.
- Mark important external events in your analysis calendar.
- Use segment analysis and additional controls for potential variables.