Failing machine learning (ML) projects in 2020 like it’s the mid-2000s
Blog: Capgemini CTO Blog
While we have increasingly stronger algorithms and Artificial Intelligence (AI ) is recognized as a transformative technology, we still see cases of ML projects failing because of preventable, well-known, and easily understood mistakes. In this blog post, we will revisit a classic review paper by John Elder from 2005, “Top 10 Data Mining Mistakes”, and see how even now, 15 years later, one might fail in the same but avoidable manner.
Mistake 0: Lack of data. The explosion of data volume has not translated to an explosion of quality data or even relevant data. As explored in other posts data quantity is not synonymous with data quality; in many cases, the reverse association is actually true, and substantial work is required to generate appropriate data. We have to focus on getting relevant data before modelling.
Mistake 1: Focus on training. Early practitioners often reported goodness of fit metrics using in-sample data, i.e. in data that an algorithm has seen, has trained upon, and knows how to predict. It does not matter if we have a deep neural network with drop-out, regularisation, convolutional and pooling layers that uses the latest activation function types to stop vanishing gradients. We will still overfit our training set like a standard logistic regression (and potentially worse) if we do not have an appropriate validation schema. We have to use a clear cross-validation framework that is fit for the purpose.
Mistake 2: Rely on one technique. While our favourite technique is great it may not be the most appropriate for every problem we come across. Having the ability to employ a variety of modelling algorithms allows us to have two-fold immediate gains: first, we can readily try different algorithms and potentially solve our problem faster and more easily. Second, we have a realistic baseline about our model’s performance, and we are not misled regarding our model’s success. We have to use an analytics pipeline that allows us to straightforward try different algorithms. As the aptly-named Nature 2019 paper “One neuron versus deep learning in aftershock prediction” by Mignan & Broccardo suggests a newer technique is not always the best.
Mistake 3: Ask the wrong question. A classification problem usually is not adequately reflected by a precision/recall pair of values but rather by our expected losses. Missing a high-value customer costs more than a low-value customer; performing an invasive and dangerous surgical procedure on a healthy individual is more costly than not immediately recognising the need to operate in a sick individual. We need to tailor our model to our task and not our task to our model. This often takes substantial involvement from expert consultants who have deep expertise in particular industries or areas.
Mistake 4: Listen (only) to the data. We need to be data-driven but we also need to remember that expert knowledge matters. Ignoring prior work in a field can lead to duplication of effort (at best), naïve mistakes, or outright non-sensical findings (at worst). Especially when operating in highly regulated consumer environments or taking into account physical processes we need to be mindful of respecting the rationale of the work alongside the empirical data. For example, physical laws (e.g. signal strength decreases based on distance) have to be respected even if they lead to lower AUC-ROC in our classification tasks. As Twyman’s law states: “Any figure that looks interesting or different is usually wrong” and as such we have to be self-critical to our data-driven insights.
Mistake 5: Accept leaks from the future. This is closely related to mistake 1. Simply put, it is easier to predict past events using a known future outcome. Knowing a team won the Premier League means they outscored most of their opponents along the season. This is temporal information leakage. It is easy to use seemingly innocuous information only to realise that during prediction/deployment time our model does not have the same information available. We should actively combat data leakage.
Mistake 6: Discount pesky cases. There is a big difference between removing corrupted data instead of inconvenient data. A human with a weight of 775 kilograms is very much a corrupted data-point but a 135 kilograms human is just an outlier. We need to recognise how to deal with these values in a coherent way. Following data quality assurances, robust statistical methods, and appropriate loss functions are the solution for dealing with difficult data points.
Mistake 7: Extrapolate. Niels Bohr is often credited (and often disputed) of saying, “Prediction is very difficult, especially if it’s about the future!” Extrapolation is hard, it requires our model to predict outside the sample space it has been trained upon. Extrapolation is possible but it is a different task than “connecting the dots” or finding similarities between known data samples and new query points. It is a common mistake to use performance validation procedures that are inadequate for the complexity of our extrapolation task. We should extrapolate consciously.
Mistake 8: Answer every inquiry. In close association with the above, one model cannot answer every question; no model has been optimized for such a universal criterion. It is easy for some to believe a single high-performing model is a panacea to all business problems. Different criteria address different aspects of the same problem. Predicting high blood pressure does not predict cardiovascular diseases in the same population (it can though suggest an interesting hierarchical structure). Answer one question and answer it well; not all models are made for transfer learning.
Mistake 9: Sample casually. Huff’s classic 1954 book “How to Lie with Statistics” starts its first chapter with the title “The sample with Build-in Bias”. Do we want to lie with statistics? We don’t need to misplace a decimal point or change a 9 to a 1, we just need to sample a population that fits our story. Sampling bias is probably the single most common reason for a well-crafted ML model to output misleading results. Our samples need to be representative and accurate. We need to understand our sample creation process and the unintentional favouritism it might reflect.
Mistake 10: Believe in the best model. Our best model is our best approximation to reality and not reality itself. In the past, especially when working with tree models, we have found that “two recursive partitioning algorithms (i.e. trees) can achieve the same prediction accuracy but, at the same time, represent structurally different regression relationships” [Hothorn et al. 2006]. This means that different but “equally good” models may lead to different conclusions about the influence of certain features in our modelling approach. We need to be cautious when theorising how a problem really works against how our model appears to work.
To conclude, it is easy to be lulled into a false sense of security about an ML model’s successful application. This has been the case in the past and this often remains true now. If you like to have an informed and experienced approach to your analysis, please connect with me here.