How We Fixed an ML Product by Rethinking Data

When you’re training machine learning models in a lab, the goal is to optimize the model for a specific metric. When you’re building real-world ML applications, you must take a broader view of the application, the economics, the end users, and the market to make the right decision.

In this post, I’ll share a hard-earned lesson in adjusting the economics of machine learning products from my experience. This can help you avoid similar mistakes and get your ML products working faster.

This post was written by Ben Dickson, a seasoned engineer, tech blogger, and mentor at our AI/ML Simulator for Product Managers.

We were helping a ride-hailing service reduce the percentage of negative reviews. According to the analysis made by the product team, the main reason for negative reviews was driver distraction. So we had developed a working prototype for an ML-powered driver distraction system and were running a pilot for a fleet of 700 vehicles in one city (see footnote for a brief description of how the system worked).

We wanted to see if our product was economically viable. Here were our initial estimates:

– Average gross merchandise value (GMV) per driver per year = $60,000

– Service commission = 30%

– One-time cost of installing ML gear in car = $200

– Annual costs of running the ML service (internet + server costs + driver bonus for reducing distraction) = $3,000

According to our estimates, every 1% reduction in negative reviews would increase GMV by 4%. Therefore, if we wanted to break even with the costs of deploying the system in one year, we would need to decrease the negative reviews by about 4.5%:

(One-time cost + annual costs) / (driver GMV * commission * GMV increase per 1% of negative review) 3,200 / (60,000 * 0.3 * 0.04) ≈ 4.5%

When we deployed the first version of our driver distraction detection system, we only managed to obtain a 1% reduction in negative reviews. After reviewing the system reports and feedback from the drivers, we noticed that the ML model was not missing many instances of distraction. (For example, one driver had attached the phone to their head with an elastic band. The system had not detected it as a driver distraction instance but the passenger had left a negative review and reported that the driver was distracted.)

We gathered a new dataset based on the misclassified instances and fine-tuned the model. After much tinkering with the model, we were able to achieve a 3% reduction in negative reviews, still a far cry from the 4.5% goal. We were on the verge of abandoning the project but decided to give it another shot and look at the problem from a fresh angle.

So we went back to the drawing board and decided to look at the data from a different angle. We asked ourselves, how were the rides and distraction instances distributed across drivers? After some exploratory data analysis, we found out that the top 20% of the drivers accounted for 80% of the rides as well as instances of negative reviews. They also had an average GMV of $100,000. The long tail of part-time drivers weren’t delivering many rides and accordingly, were not getting many reviews.

From this, we concluded that if we focused on the high activity drivers, we could change the dynamics of the product. First, we would be able to capture a lot more distraction instances with fewer product deployments, slashing the total costs of running the system by 80%. And second, the unit economics of the system would change in a positive way. In this new setting, we only needed to reduce negative reviews by 2.6% to break even with the costs of deploying the system within a year:

(One-time cost + annual costs) / (driver GMV * commission * GMV increase per 1% of negative review) 3,200 / (100,000 * 0.3 * 0.04) ≈ 2.6%

Since our system was already reducing negative reviews by 3%, it would turn in profits within a year of deployment and was economically viable.

The lesson is that as product managers, we need to take the broader perspective and look at the problem, data, and stakeholders from different perspectives. Full knowledge of the product and the people it touches can help you find solutions that classic ML knowledge won’t provide. As a side note, this is another example of the Pareto principle or the 80-20 rule, where roughly 80% of consequences come from 20% of causes (the “vital few”). This is not something you learn in machine learning courses, but is extremely vital in AI/ML product management.

Footnote: At first, we had considered developing a real-time computer vision system that would alert the driver when they were distracted. But this turned out to be a poor design because drivers would ignore or turn off the alarm. Instead, we created a system that would gather images of drivers at specific intervals, classify them as distracted/not distracted, and then provide the driver with regular reports on their alertness and distractedness. We used an incentive system (e.g., financial rewards or penalties) that encouraged drivers to keep their distractions low. It turned out that this approach was much more effective in changing the behavior of drivers and promoting safe driving.

How we turned around an ML product by looking differently at the data