Editor’s note: this is part of our investigation into analytic models and best practices for their selection, deployment, and evaluation.
We all know that a working predictive model is a powerful business weapon. By translating data into insights and subsequent actions, businesses can offer better customer experience, retain more customers, and increase revenue. This is why companies are now allocating more resources to develop, or purchase, machine learning solutions.
While expectations on predictive analytics are sky high, the implementation of machine learning in businesses is not necessarily a smooth path. Interestingly, the problem often is not the quality of data or algorithms. I have worked with a number of companies that collected a lot of data; ensured the quality of the data; used research-proven algorithms implemented by well-educated data scientists; and yet, they failed to see beneficial outcomes. What went wrong? Doesn’t good data plus a good algorithm equal beneficial insights?
The short answer: evaluation. You cannot improve what you improperly measure. A misguided evaluation approach leads us to adopt ineffective machine learning solutions. I have observed a number of common mistakes in companies trying to evaluate their predictive models. In this series, I will present a variety of evaluation methods and solutions, with practical industry examples. Here, in this first piece, I’ll look at accuracy evaluation metrics and the confusion between a good prediction and a good nudge.
Good predictions? Good nudges.
Let’s take the retail industry as an example. Many retailers believe if they can accurately predict their customers’ future purchasing preferences, they can increase sales. After all, everyone has heard the stories of Target identifying a pregnant teen and Nextflix’s success with its recommendation system.
This supposedly flawless assumption that accurate predictions can increase sales is easily overthrown, however, when we implement the accuracy evaluation metrics in reality. In machine learning, a metric measures how well a predictive model performs, usually based on some predefined scoring rules. Different models can then be compared. For instance, in recommender systems research, metrics such as recall/precision, Root Mean Square Error (RMSE), and Mean Average Precision (MAP) are frequently used to evaluate how “good” the models are. Roughly speaking, these metrics assume that a good model can accurately predict which products a customer will purchase or give the highest rating.
What’s the problem then? Let’s say I want to buy cereal and milk on Google Shopping Express. Once I open the app, let’s assume it accurately predicts that I will buy cereal and milk and shows them to me. I click and buy them. That’s great, right? But wait, the retailer originally expected the predictive model to increase sales. In this case, I was going to buy cereal and milk anyway, regardless of the accuracy of the prediction. Although my customer experience is probably improved, I do not necessarily buy more stuff. If the aim is to increase sales, the metric should, for example, focus on how well the model can predict and recommend products that will nudge me to buy much more than just cereal and milk.
Oftentimes, the true objective is to nudge customers toward some choice or action.
Researchers and businesses have a vested interest in nudging. For instance, Lowe’s grocery store in El Paso successfully conducted an experiment that nudged customers to buy more produce. How? Simply by adding huge green arrows on the floor that pointed toward the produce aisle. Another successful example of nudging is “checkout charity”: retailers raise millions of dollars for charity by asking customers for small donations at the checkout screens.
Applying the predictive power of machine learning techniques to nudge, if done right, can be extremely valuable. Many bright statistical minds are, however, confused by the subtle difference between a good prediction and a good nudge. If we mix up the two in the evaluation process, we may end up choosing a model that does not help us achieve our goal. For example, Facebook’s emotional contagion experiment, despite its controversy, not only shows us how data influence users’ emotions, it also gives us a vivid example of when the goal of a metric should be to measure influence (or nudge) rather than predict.
The best metric is one that is consistent with your business goal. Oftentimes, the true commercial objective is to nudge customers toward some choice or action. Perhaps it is time for data practitioners to increase their awareness of metrics that reflect this kind of psychological impact of machine learning — sometimes the most effective result requires more than just good prediction.