Let’s start with a practical task.

Say a company’s management wants to allocate significant resources to the development of infrastructure that would increase their app’s speed. The hypothesis is that increasing the speed of the app will have a positive effect on the user experience and the key metrics.

Think of an experiment (an A/B test) to validate this hypothesis.

→ Test your product management and data skills with this free Growth Skills Assessment Test.

Learn data-driven product management in Simulator by GoPractice.

Learn growth and realize the maximum potential of your product in Product Growth Simulator.

Experiments where you make your product worse

The above-mentioned example is one of the challenges that Simulator students face. Almost 70% of the students suggest speeding up the app and then measuring the effect using an A/B test.

Here’s why this approach is problematic: Optimizing the app’s speed is a costly process. And the goal of the experiment is to decide whether the effect of increasing the speed of the app is worth making the investment in designing and implementing the right components and modifications. But if you have to implement the speed upgrade just to conduct the experiment (and pay the costs), then what’s the point of the experiment in the first place?

A quick and inexpensive alternative to test this hypothesis is to use an experiment where you make your product worse. We will slow down the test version of the product (which is usually much simpler than implementing the speed optimization) and check whether doing so will have a negative impact on the key product metrics. If the impact is negative, then we can say that the speed of the app affects the key metrics and implement the speed upgrade. Thus, we should allocate resources to accelerate the product. If there is no measurable impact, then we can leave things as they are and avoid the costs of making a modification that will not have a substantial return on investment.

In reality, things are a bit more complicated. The dependency is not always linear. For example, the speed of a product may already be so low that its deterioration won’t affect the metrics in any way. The approach, however, stays the same. It simply shows the range of its applicability. We must consider such things when designing and interpreting the results of these experiments.

There’s a real-world example for this approach. The Financial Times conducted this experiment once. You can read about it here. From my personal experience I can say that this approach is widely used at Facebook and other big tech companies.

Only 30% of Simulator students suggested slowing down the app (it is worth noting that they weren’t given a hint within the article’s title). In real life, the percent of people who would dare make such as suggestion is much smaller. The percent of people who would actually do it would probably be close to zero.

In companies with a poor culture of experimenting and working with data, the idea of making the product worse will most likely meet resistance. Notably, even in companies where the data has already become an important part of the product development cycle, offering to conduct this type of experiment may well bewilder the product team members.

Three stages of A/B testing evolution in companies

For most companies, A/B testing is a theoretical concept that has nothing to do with real work. Some companies use experiments to study the effect of new features and product changes on key metrics. However, only a few of them use experiments to determine what direction the product’s development should take.

Experiments make it possible to understand which levers influence the product and which don’t. They also measure how great the potential of each of the levers is. Without this information, prioritizing resources and projects looks more like making random guesses than a rational, calculated process.

Experiments where you make products worse is an effective tool to reduce uncertainty by first evaluating the potential impact of a particular product’s component on key metrics.

Experiments where you make your product worse

Company X believes that push notifications is the driver for its app’s retention rate. For most companies, this simple assumption may be enough to assemble a dedicated team that will be in charge of push notifications. But the management of Company X wants to make sure that this is the best possible use of its resources. So it decides to experiment by disabling push notifications for some users. Once done, they will measure the impact of push notifications on the key product metrics.

The results of this experiment provide the foundation for making the right decision. The experiment will demonstrate the impact push notifications have on the app’s key metrics. Now that it has reliable information on whether a dedicated team could increase the effect, the company’s management can decide whether the investment is worth it.

This approach is applicable to a lot of hypotheses, which, at first glance, may seem like obvious ways to improve user metrics and experience, but in reality fail to do so.

Suggestion: Adding more levels will improve the game, because levels are the core concept of the game and holds everything else together.
How to verify: Remove some of the available game levels for half of new users and examine the effect on the key metrics.

Suggestion: If the users don’t get a quick response from the support team, then we risk losing them. Therefore, we need to reduce the response time, and to do this we may …
How to verify: We must form several customer cohorts who will receive replies from the support team at different speeds. Then we need to assess the effect of response time on key metrics.

Suggestion: We receive a lot of complaints regarding the quality of the search feature on our online store. We need to form a team that will be in charge of improving the search functionality.
How to verify: First, we remove the search feature for some of the users and check whether the key metrics are affected. Then we will evaluate how much the team will be able to improve it, what benefits will it bring, how it relates to the overall cost of the project, and what other options are on the table.

Experiments are the only way to check whether an observed correlation between a product modification and key metric changes is causal or not. Such experiments help us test in a quick and affordable fashion how deteriorating a certain parameter affects key metrics. That is why they work so well when it comes to prioritizing projects.

Cons of experiments where you make your product worse

Those who stand against this type of experiments say that deteriorating the product spoils user experience, increases churn rate, brings negative feedback and affects the brand reputation in a bad way.

Based on personal experience, I would say that this concern is greatly exaggerated. You will be surprised at how often degrading tests will have no significant effect on the product (and this is a very important piece of knowledge).

But even if there is a negative impact, then try looking at this issue from a different perspective. You have limited resources. One option is to distribute them between projects blindly (or almost blindly, that is, based on your current hypothesis about the product). In this case, you have an unknown chance of improving the user experience in the product. But you’ll have to spend several months to implement the new feature and evaluate its effect on the key metrics. If your experiment fails, you would have wasted precious resources and time that could have been spent on more impactful things.

Alternatively, you can conduct a series of quick degrading experiments, spoiling the experience of a small percentage of the product’s users for a short period. But as a result, you will cut down the time of reaching a conclusion and increase the chances of correctly distributing your resources. As a result, you will make the product better for all of its users, including those who will join in the future.

Unintentional experiments where you make your product worse

If my arguments did not convince you to use this type of experiment intentionally, you can still benefit from it in other ways.

All companies run degrading experiments. Most of them are unintentional and happen when the team accidentally breaks something in the product.

The next time this happens in your company, do your best to make things right again, but also study the impact of the negative changes on the metrics. Thus observing the damage made can help you find a new lever of influence on the product.

The unintentional degrading experiment I made that influenced the product’s strategy

API.AI (now Dialogflow) is the company where I was developing the smart voice assistant Assistant.ai.

It was a very challenging task with no easy solutions at hand. Most services are built around solving one specific problem, and they do their best to solve it well. But in the case of our smart assistant, the expectations of users were so diverse that it was nearly impossible to choose the focus and design the perfect onboarding experience for new users.

Our initial strategy was to identify the most popular features and skills that users expected in an assistant. We would then place these features at the center of our product and build the onboarding flow around these skills (quite a challenging task for a product where the main interface is vocal, not visual). This approach led to a number of small victories. But due to natural limitations, such as the duration of the first session, the voice interface, and a huge variety of usage patterns, it led to a dead-end.

But then help came from an unexpected place when one day, our key product metrics fell by 20%.

It took us a whole week to investigate what was the matter. The fact that we had launched a new version of the product right before the drop only made things more complicated. As it turned out, the update had nothing to do with the drop in metrics. The reason was within the changes made in the third-party voice recognition service from Google, which we used on Android for some of the languages in our product.

Google had started to experiment with the way its voice recognition service detected the end of speech by examining pauses. The metrics dropped when Google decided to reduce the time it would wait for a person to continue talking. But in reality, while speaking, we often pause when we are thinking about what to say next. This is especially true for communication with virtual assistants when you ask it to do something (make a note or to add a reminder, etc). As a result, the users often didn’t have time to complete their sentence.

The decline in the quality of one of the fundamental technologies supporting the Assistant led to a greater change in metrics than any of the changes we had made earlier.

Our understanding of the product model and what was and wasn’t important changed dramatically after this unintentional degrading experiment. Before this incident, we had focused on what the Assistant could do and how it presented its skills to users. But then it became obvious that the quality of the key technologies that the product was built around (i.e., voice recognition and Natural Language Understanding [NLU]) is a much more powerful lever to improve the user experience and make the product better.

This kind of experience is crucial when discussing the prioritization of vital resources. This is the knowledge I obtained as a result of an unintentional experiment that made the product worse. Who knows what revelations your experiments will bring?