What are A/B tests. A guide for product managers

Many teams believe they know their product and users well, yet making decisions based on intuition or “expert opinions” does not protect them from mistakes. Relying on facts significantly reduces this risk.

Product development resembles a scientific study: you formulate a hypothesis, run an experiment and draw conclusions from the data. In digital products, the A/B test is the key way to validate hypotheses.

This comprehensive guide covers the essentials of A/B tests: why they are necessary, and common pitfalls to avoid along the way.

Definition of an A/B test
Why A/B tests matter in product work
Other types of tests
How to run A/B tests: a step‑by‑step guide
Examples of product improvements through A/B testing
Experiments where you make your product worse—a fast way to validate hypotheses
The peeking problem in A/B tests
Mistakes and pitfalls in A/B testing

Definition of an A/B test

In product management an A/B test is a type of experiment that lets you measure the impact of a change by comparing two versions of a product.

Users are randomly divided into two groups: the control group (A) sees the current version, while the test group (B) sees the modified version. After running the experiment, you analyze predefined key metrics to understand how the change affected user behavior.

How A/B testing evolved in tech?

The roots of A/B testing lie in scientific and medical practice. The first known controlled experiment was conducted in 1747 by the Scottish surgeon James Lind, who divided sick sailors into groups and prescribed different treatments. Ultimately, only the group receiving citrus fruit improved. But widespread use of randomized controlled trials began in the 20th century, when the British statistician Ronald Fisher articulated their core principles. In 1948 the first trial of this kind measured the effectiveness of streptomycin against tuberculosis.

Over time, controlled experiments moved beyond science. By the 1950s marketers were using them. With the rise of the internet the method found its way into digital products: first in e‑commerce and online services, then across the broader tech industry. In the early 2000s Google and Microsoft started using A/B testing at scale to improve user experience and revenue.

Why A/B tests matter in product work

You can make product decisions without A/B tests—relying instead on intuition, expert judgement, market trends, competitor successes or stakeholder opinions. Sometimes this works, but even then you may not know why it worked.

A/B tests solve that problem. They allow you to isolate a change and understand whether it actually caused the observed effect. Instead of altering the entire product, you can experiment on a small part of it and attribute any impact to that specific change. In short, A/B testing helps you act on facts and replace guesswork with evidence‑based decisions.

What A/B tests give to the product team and the business

Reduce risk and error: Validate hypotheses on a small sample before a full rollout, minimizing the likelihood of harming the business.
Optimise the user experience: Identify which changes make the product more convenient (e.g., design tweaks, buttons, forms, content).
Find working solutions to grow metrics: Increase conversion, average order value, retention and other key indicators (KPIs) based on confirmed data.
Improve the product iteratively: The cumulative effect of continuous, small improvements produces a significant positive impact over time.
Save time and resources: Instead of investing heavily in a feature that may not “take off,” test the idea first. This is especially valuable when budgets are tight and teams are overloaded.
Build product culture: Teams that run experiments regularly learn to think in hypotheses, justify decisions, and not fear mistakes—because they have a safe framework to test ideas.

Companies like Google, Amazon, Netflix and Booking.com run thousands of A/B experiments a year, testing even tiny changes. For them experiments are an integral part of product culture.

Other types of tests

A/B testing is the most common method, but there are other ways to conduct experiments:

A/A test: Both groups receive identical versions of the product to verify that the testing infrastructure works correctly. If significant differences appear between two identical variants, it may indicate data tracking issues, problems in group assignment (often called Sample Ratio Mismatch), or other errors. Running an A/A test helps surface and fix such issues before you start the real A/B testing.
A/A/B test: A variant of the A/B test with two control groups (A and A) and one test group (B). This allows you to check whether the testing setup is consistent; if the two “A” groups show notable differences during the experiment, you know something is broken.
Switchback experiments: The product versions alternate over specific time windows for an entire population or location. For example, cohort X (e.g., representing a specific city) will see version A of the product on one day and version B on the next. This approach is used when users can influence one another (as in social products or network‑effect platforms) and helps minimize cross‑group interference.
Multivariate testing (MVT): Multiple elements are changed simultaneously (e.g. the button colour and the text) to find the best combination. MVT requires a large user base because the number of possible combinations multiplies with each added element, meaning each variant receives a smaller slice of traffic.

In this guide we focus on A/B tests. Once you understand the classic A/B test, learning other experiment types becomes much easier.

How to run A/B tests: a step‑by‑step guide

Like scientific experiments, A/B tests require clear hypotheses, thoughtful design, careful execution and proper interpretation of results. Before we dive into the steps, let’s define some key concepts used in the process.

Control and experimental groups

In A/B testing users are divided into two groups:

Control group (A): receives the current version of the product or feature.
Test group (B): receives the modified version that needs evaluation.

Comparing results across the groups shows the impact of the change.

Important points:

Participants in each group should not overlap. They must belong to the same segment and be solving the same problem in the product to ensure clean, unbiased results.
Users should be randomly assigned to the control and experimental groups.

Defining success metrics

Before starting the test you must specify which indicators will be measured to gauge the effectiveness of the change. Common examples include:

Conversion: Percentage of users who complete a target action (e.g., a purchase).
Engagement: Time spent on the site, number of pages viewed.
Retention: Share of users who return to the product after a given period.

These are just examples. Your success metric should tie back to your specific hypothesis.

P‑value and statistical significance

The p‑value is a statistical indicator that helps you decide whether differences between groups are random or meaningful. A low p‑value (typically below 0.05) suggests that the differences are statistically significant and unlikely to be due to random chance.

Step 1: Define the subject and goals of the test

In reality, you can analyze any metric as long as it is relevant to the task. For recommendation tests you might look at click‑through rates on items; in a mobile banking experiment you might measure the time to complete a target action. In media products you might track the percentage of articles read to the end, and so on.

Important: In A/B tests, you should change only one parameter at a time so you can pinpoint what affected the result. The user experience for both groups must remain identical in all other respects.

Before launching an A/B test, formulate the hypothesis clearly using the pattern:

If we change [specific element], it will improve [metric] by X %.

For example:

Too vague: “Update the payment page design.”
Concrete and measurable: “If we change the ‘Buy’ button colour from grey to orange, conversion will increase by 5 %.”

What metrics can you influence through an A/B test?

Conversion (e.g., percentage of users who place an order).
Clicks (on a button, link or ad banner).
Engagement (time on site, number of pages viewed).
Retention (how many people return after N days).

These illustrate the kind of metrics linked to the hypothesis.

Step 2: Choose a tool

Running an A/B test requires specialized tools. Your choice depends on factors such as the team’s technical capabilities, platform (mobile vs. web), budget, company size, testing goals, and integration with existing systems. Below is a selection of common solutions (note that this list is not exhaustive).

For web pages

Name	Website	Comment
Optimizely	optimizely.com	One of the most powerful tools for client‑side and server‑side A/B tests.
VWO	vwo.com	Offers a visual editor, analytics, heat maps and more in one interface.
Zoho PageSense	zoho.com/pagesense/	An affordable solution for small and medium‑sized businesses with wide functionality.
Crazy Egg	crazyegg.com	Suitable for small businesses; includes heat maps and simple tests.
Unbounce	unbounce.com	Code‑free landing‑page creation and testing—great for marketers.
Convertize	convertize.com	Intuitive visual editor; convenient for marketing teams.
Adobe Target	adobe.com/target	Enterprise‑level solution with advanced personalisation and AI optimisation.
Kameleoon	kameleoon.com	Advanced A/B and multivariate testing for large projects.
SiteSpect	sitespect.com	Testing without JavaScript—suited for high‑traffic sites; server‑side access.

For mobile applications

Name	Website	Comment
Firebase A/B Testing	firebase.google.com	Free tool from Google, easily integrates with other Firebase services.
Airship	airship.com	Supports both visual and server‑side tests, including feature flags.
UXCam	uxcam.com	Provides visual analytics, heat maps, session recordings and A/B experiments.
Harness	harness.io	Server‑side testing focused on engineering teams.
LaunchDarkly	launchdarkly.com	Feature‑flag‑based experiments, suited for large mobile products.
CleverTap	clevertap.com	Combines A/B tests with marketing communications, including push notifications.
Dynatrace	dynatrace.com	Full growth platform: A/B tests, analytics, feature flags.
Mixpanel	mixpanel.com	Primarily behavioral analytics, but includes basic experimentation tools.
Kameleoon Mobile	kameleoon.com	Native‑app solution fully integrated with the main Kameleoon platform.

Many of these tools offer free trials so you can evaluate their features before purchasing. Be sure to check each service’s terms—some require a subscription to access results. If you join an established product team, existing tools are likely already in place, so selecting a tool is more relevant for new products or startups.

Step 3: Design the experiment

Once you have defined the hypothesis and metrics, move on to the technical design.

Define the test parameters

Minimum Detectable Effect (MDE): Decide how big a difference between the groups must be for you to deem it important.
Significance level (α) and Test power (1 – β): Determine your risk tolerance. The significance level defines how much error you can tolerate if you mistake random noise for a real effect (a false positive). Test power dictates how confident you want to be in successfully detecting a real effect when one actually exists (avoiding a false negative).

Choose a statistical test: Select the appropriate mathematical method for your data (for example, using a t-test to compare means).
Calculate sample size: Determine how many users are needed to detect a difference if one exists. A highly recommended industry tool for this is Evan Miller’s Sample Size Calculator.

Set up randomization: Ensure users are randomly assigned to control and test groups, typically 50/50, to avoid bias.

Step 4: Run the experiment – important process aspects

Commit to a fixed testing period: Decide how long the experiment will last based on the sample‑size calculation before you launch. Avoid changing the test duration mid‑experiment to prevent bias.

How long should you run the test?
Test duration depends on several factors:

Sample size requirements: To get a reliable result, you must collect enough data. This depends on the difference you want to detect, and the desired accuracy.
Seasonality and activity cycles: The test period should cover a full cycle of user activity (e.g., a week) to smooth out daily and weekly fluctuations (e.g., weekend vs. weekday traffic).
Platform recommendations: As a general baseline, platforms like Facebook suggests running tests for at least seven days but no longer than 30 days to gather sufficient information.

Beware the “peeking problem”: Peeking means prematurely analyzing interim results before the test period is finished. Doing this inflates the risk of a Type I error (false positive) and leads to wrong decisions. We will discuss the peeking problem in detail later.

Monitor technical metrics: Ensure that both product variants work correctly so that technical issues do not skew results. Confirm that users are evenly split between control and test groups.

Step 5: Analyse the results

Check statistical significance

Not every observed difference between variants is meaningful. Sometimes differences arise by chance rather than true effect. You must evaluate statistical significance to ensure the effect is linked to the change.

Many online tools make it easy to assess the statistical significance of A/B tests, such as SurveyMonkey’s A/B testing calculator for statistical significance. Many A/B testing platforms also provide built-in significance calculators.

For those who want to understand the math behind the process:

If you want to dive deeper into the statistical methods used in analysing A/B tests, it is crucial to understand the p‑value. A p‑value represents the probability that the observed differences between groups arose by chance.

A p‑value is a statistic that helps determine how well the experimental results align with the null hypothesis. The null hypothesis assumes there is no effect or difference between the groups being compared. The p‑value shows the likelihood of obtaining the observed results, or more extreme ones, if the null hypothesis is true.

The formula for calculating a p‑value depends on the chosen statistical test and the data distribution. In general, it is computed as the area under the probability density curve of the test statistic, starting from the observed value and continuing into the more extreme tail. The larger this area, the higher the p‑value and the less significant the test results.

A common threshold is p = 0.05. This means that if the p‑value is below 0.05, the differences are considered statistically significant.

Suppose we test a new version of a web page and want to know if it affects conversion compared with the current version. The null hypothesis states that there is no difference. After the test, we get a p‑value of 0.03, meaning there is a 3 % chance of seeing these or more extreme results assuming there is no real effect.

Key points to remember:

— The p‑value does not indicate the size of the effect; it only reflects the probability of its existence.
— A low p‑value suggests a low probability that the results occurred by chance, but it doesn’t guarantee that the effect is practically meaningful for the business.
— A high p‑value indicates that the observed differences could have occurred randomly, but it does not prove that there is no effect (it simply fails to prove there is one).

Account for external factors

Consider outside influences such as seasonality, marketing campaigns, or news events. When the test uses two equal cohorts under identical conditions, external factors should impact both groups equally. However, you must still be aware of parallel product or marketing changes during the experiment, as they could affect user behavior and distort the result.

Confidence intervals

Confidence intervals (CI) indicate the range within which the true value of a metric likely falls (typically calculated at 95 %). For example, if an A/B test yields a conversion rate of 5 % ± 1 % (95 % CI: 4–6 %), there is a 95 % probability that the true conversion rate lies within that range.

Note that if the confidence interval for the difference between variants does not include zero, the effect is statistically significant.

Dealing with noise and false positives

Noise (random fluctuations or errors unrelated to the change) and false positives can distort the interpretation of a test. To mitigate these:

Correct for multiple comparisons: Running many tests or tracking many metrics increases the risk of false positives. Statistical adjustments like the Bonferroni correction help control this risk.

Run A/A tests: Testing two identical variants helps gauge baseline noise and false‑positive rates.

Step 6: Make decisions

The results of A/B tests provide valuable input for decisions, but they are not orders to act. Even when a statistically significant positive effect is found, consider the context and other factors before rolling out the change.

Possible post-test actions include:

Analyze and adjust the hypothesis: If results don’t meet expectations, revisit assumptions and explore alternative solutions.
Decline to roll out: If the potential benefit does not justify the costs or risks, it may be wise not to implement the change.
Execute a limited rollout: Even after achieving positive results, it can be useful to introduce the change to a small share of users first to see whether it holds up in real‑world conditions.

Analyze side effects: A change could improve the main metric but hurt others. Before rollout, check that secondary metrics and user experience are not adversely affected.

Examples of product improvements through A/B testing

Case 1: Optimising Netflix cover art

Problem: Netflix found that users often decide what to watch based on cover art. Standard covers did not always attract attention or motivate viewing.

Hypothesis: Personalizing covers based on user preferences would increase engagement and total time spent in the product by 5%.

Testing: Netflix built a system that automatically selected the cover most aligned with each user’s interests—for example, showing comedy‑themed covers to users who frequently watched comedies.

Result: The personalized covers significantly increased clicks and viewing time.

Case 2: Improving Amazon recommendations

Problem: Amazon’s recommendation system sometimes failed to suggest relevant products, reducing conversion and customer satisfaction.

Hypothesis: Implementing a new machine‑learning model for personalised recommendations would improve accuracy and boost sales.

Testing: Amazon ran an A/B test comparing the current recommendation system with the new model. Users were randomly split: the control group received the old recommendations, while the test group received the new ones.

Result: The new model increased click‑through on recommendations and conversion. Amazon rolled out the updated personalisation system to all users.

Case 3: Optimising Spotify’s landing page

Problem: Spotify aimed to increase the number of premium subscriptions by providing highly relevant content for users coming from specific search queries.

Hypothesis: Creating customized landing pages tailored to users’ interests would increase conversion to premium.

Testing: Spotify ran an experiment targeting users in Germany who searched for “audiobooks” and clicked on an ad. Half were directed to the standard generic page, and the other half to a specially designed page focused on audiobooks.

Result: The customized page increased premium subscriptions by 24%.

Experiments where you make your product worse—a quick way to validate hypotheses

Experiments where you make your product worse is an effective but seldom‑used tool. Product teams sometimes consider a factor important without objective evidence. For example:

“We need to speed up the app, otherwise users will leave.”
“If we stop sending push notifications, retention will drop.”
“Fast support response is key to user satisfaction.”

Instead of investing resources in improving something (which may be costly), a degradation test lets you check how much that factor actually affects key metrics.

Principle: Intentionally make an aspect of the product worse and measure its impact on user behavior.

Scenario: A company wants to invest in accelerating its app because it assumes faster load times will increase retention and conversion.

A/B test: Instead of immediately spending money on optimization, the team slows down the app (for example by 1, 2 and 3 seconds) and observes how key metrics change.

Possible result: The key metrics do not change, meaning that investing resources in speeding up the app is not a priority at this stage.

Businesses worry that making the product worse will drive users away or provoke negative reviews. But if a degradation test is designed properly, it affects only a small share of the audience and runs for a limited time.

The peeking problem in A/B testing

What is the peeking problem?
One of the most common and dangerous mistakes in A/B testing is peeking: reviewing interim results too soon. Teams may decide to stop a test when they see a significant difference, without waiting for the required sample size to be reached.

If you stop the test at the first sign of significance, you risk making a false conclusion.

Why does peeking distort results?

The phenomenon of random fluctuations

If a test shows a significant difference very quickly, it is likely to be a fluke. Even when there is no real difference between variants A and B, metrics will still oscillate by chance. Over the course of a long test, the difference between the groups will occasionally cross the threshold of statistical significance even if there is no true effect.

Thus, if you check the results too frequently, standard statistical methods start producing false positives.

How can you avoid this error?

Fix the sample size in advance: Before starting, calculate how many users you need to detect a meaningful difference. For example, if you need 100,000 users, don’t analyze the test until that amount of data is collected.
Use sequential testing: This method allows you to adapt the test dynamically without inflating the p‑value. Sequential testing lets you analyze data as it comes in and decide to finish the experiment earlier than planned if the results become statistically significant. Google and Optimizely use Sequential Experiment Design, which adjusts statistical calculations to account for frequent checks.
Use a Bayesian approach.
In statistics there are two main ways to evaluate probabilities: frequentist and Bayesian.
- The frequentist method estimates probability based on the frequency of events across many repetitions. For example, if a coin lands heads 50 times out of 100 throws, you might say the probability of heads is 50%.
- The Bayesian method treats probability as a degree of belief that is updated as new information arrives. Suppose you know most of the candies in a box are chocolate, so if you pick one at random you expect it to be chocolate. If you later learn that caramel candies are also in the box, you update your confidence in picking a chocolate.
In A/B testing, Bayesian methods allow you to factor in uncertainty and update the probability of a variant’s success in real time as new data arrives. Unlike the frequentist approach, Bayesian A/B tests explicitly model uncertainty and adjust the estimated probability of a winning variant as more data becomes available. However, Bayesian methods are also susceptible to peeking if not used properly.
Don’t make a decision on the first sign of significance.
If the difference becomes significant, keep the test running until the planned end. If the difference remains stable for several days, you can then draw more confident conclusions.

Errors and traps in A/B testing

Prematurely ending the test

The team sees a promising difference and rushes to stop the experiment before enough data is collected.

Risk: High chance of a false positive. What looks like success on day 3 may look very different on day 10.

What to do: Define the sample size and test duration in advance and avoid acting before the test ends.

Lack of a clear hypothesis

The test is launched without a clear understanding of what is being tested and what outcome counts as success.

Ignoring the novelty effect

A short‑term spike in interest in a new element is mistaken for sustained growth. After the test ends, the metric “reverts” and the effect disappears.

What to do: Analyze behavior over time and check whether the effect is stable during the test.

Multiple testing and p‑hacking

Running dozens of tests at once or analyzing many metrics without adjustments increases the chance of finding “significant” results by luck.

What to do:

Predefine the hypothesis, metric and variant being tested.
Apply corrections for multiple comparisons (e.g. Bonferroni).
Avoid searching for a positive result after the fact—this is p‑hacking.

Sample ratio mismatch (users crossing between groups)

The same user appears in both the control and test groups (for example, via different devices or due to technical bugs). This can distort data and blur the effect.

What to do:

Tie each user rigidly to one version (using a user ID, cookie, etc.).
Check the user ratio in the groups; it should be close to the assigned distribution (e.g. 50/50).

Influence of hidden variables and seasonality

External factors such as holidays, marketing campaigns or news events can distort results and lead you to attribute an effect to the change.

What to do:

Plan tests outside unstable periods.
Mark important external events in your analysis calendar.
Use segment analysis and additional controls for potential variables.

What are A/B tests. A guide for product managers

Contents

Definition of an A/B test

How A/B testing evolved in tech?

Why A/B tests matter in product work

What A/B tests give to the product team and the business

Other types of tests

How to run A/B tests: a step‑by‑step guide

Control and experimental groups

Defining success metrics

P‑value and statistical significance

Step 1: Define the subject and goals of the test

Step 2: Choose a tool

Step 3: Design the experiment

Step 4: Run the experiment – important process aspects

Step 5: Analyse the results

Step 6: Make decisions

Examples of product improvements through A/B testing

Case 1: Optimising Netflix cover art

Case 2: Improving Amazon recommendations

Case 3: Optimising Spotify’s landing page

Experiments where you make your product worse—a quick way to validate hypotheses

The peeking problem in A/B testing

Errors and traps in A/B testing

Prematurely ending the test

Lack of a clear hypothesis

Ignoring the novelty effect

Multiple testing and p‑hacking

Sample ratio mismatch (users crossing between groups)

Influence of hidden variables and seasonality

Learn more