A team we know well set out to improve customer retention in an online service. Their plan was to build an ML system that could predict which customers were likely to churn before their subscription renewal came up, and then offer those users a timely discount to encourage them to stay.

While training the model, the team’s primary focus was prediction accuracy—the percentage of churning customers correctly identified ahead of time. Their key business metric was the overall customer retention rate.

The first iteration looked like a success. They deployed the system, started targeting predicted churners with discounts, and watched the overall retention numbers rise. A win, right?

Not quite. After reviewing the financial impact a few months later, the picture changed. Retention was indeed higher, but the team realized they might actually be losing revenue because of the intervention. There were two main reasons:

1. False positives were expensive.

The model produced a substantial number of false positives—customers flagged as likely to churn even though they probably would have stayed. These users were receiving unnecessary discounts, directly cutting into margins.

2. Intervention wasn’t always effective.

Some customers predicted to churn still left even after receiving a discount. In these cases, not only was the future subscription revenue lost, but the cost of the discount was also incurred with no benefit.

By focusing on overall prediction accuracy and the single business metric of retention, the team had missed these downstream costs. They were optimizing for the wrong goal—or at least, not the full one.

This pushed them back to the drawing board. Using data from the initial deployment, they categorized customers into three groups:

  • Loyal Stayers: Unlikely to churn. (Don’t offer discounts.)
  • Potential Churners (Retainable): Likely to churn but receptive to interventions. (Target these.)
  • Likely Churners (Lost Causes): Likely to churn regardless of intervention. (Discounts are wasted here.)

Instead of focusing on overall accuracy, they shifted to measuring accuracy per class—a more intuitive alternative to precision/recall when communicating with business stakeholders.

They also introduced a second key business metric: Net Revenue Impact of Intervention. The refined goal was to find the sweet spot where retention improved and the interventions generated positive net revenue over a 12-month window (“revenue from retained churners” minus “revenue lost from discounts to loyal customers”).

After training a new version of the model with these revised objectives, the results aligned far better with the business reality. Overall retention dipped slightly compared to the naive version (because they stopped discounting loyal customers), but Net Revenue Impact improved significantly. Wasted discounts were drastically reduced.

The major lesson the team walked away with: technical metrics like accuracy can be dangerously misleading when they aren’t tied to the full picture of business value, including costs and downstream effects. Sometimes, a model that looks “worse” on one metric is substantially better when you measure what truly matters.

(They eventually went even further and explored uplift modeling—but that’s a story for another time.)

To enhance your skills in working on AI/ML products, you can benefit from: