The purpose of designing and training algorithms is to set them loose in the real world, where we expect performance to mimic that of our carefully curated training data set. But as Mike Tyson put it, “everyone has a plan, until they get punched in the face.” And in this case, your algorithm’s meticulously optimized performance may get punched in the face by a piece of data completely outside the scope of anything it encountered previously.
When does this become a problem? To understand, we need to return to the basic concepts of interpolation vs. extrapolation. Interpolation is an estimation of a value within a sequence of values. Extrapolation estimates a value beyond a known range. If you’re a parent, you can probably recall your young child calling any small four-legged animal a cat, as their first classifier only used minimal features. Once they were taught to extrapolate and factor in additional features, they were able to correctly identify dogs too. Extrapolation is difficult, even for humans. Our models, smart as they might be, are interpolation machines. When you set them to an extrapolation task beyond the boundaries of their training data, even the most complex neural nets may fail.
What we need to adopt, and this is not a unique problem in the domain of machine learning, is data validation. Google engineers published their method of data validation in 2019 after running into a production bug. In a nutshell, every batch of incoming data is examined for anomalies, some of which can only be detected by comparing training and production data. Implementing a data validation pipeline had several positive outcomes. One example the authors present in the paper is the discovery of missing features within the Google Play store recommendation algorithm — when the bug was fixed, app install rates increased by 2 percent.
Clearly this is a problem for mission critical algorithms. Machine learning models in healthcare bear a responsibility to return the best possible results to patients, as do the clinicians evaluating their output. In such scenarios, a zero-tolerance approach to out-of-bounds data may be more appropriate. In essence, the algorithm should recognize an anomaly in the input data and return a null result. Given the tremendous variation in human health, along with possible coding and pipeline errors, we shouldn’t allow our models to extrapolate just yet.
I’m the CTO at a health tech company, and we combine these approaches: We conduct a number of robustness tests on every model to determine whether model output has changed due to variation in the features of our training sets. This training step allows us to learn the model limitations, across multiple dimensions, and also uses explainable AI models for scientific validation. But we also set out of bound limitations on our models to ensure patients are protected.
If there’s one takeaway here, it’s that you need to implement feature validation for your deployed algorithms. Every feature is ultimately a number, and the range of numbers encountered during training is known. At minimum, adding a validation step that ascertains whether a score in any given run is within the training range will increase model quality.
© 2021 LeackStat.com
2025 © Leackstat. All rights reserved