The field of AI and machine learning is arguably built on the shoulders of a few hundred papers, many of which draw conclusions using data from a subset of public datasets. Large, labeled corpora have been critical to the success of AI in domains ranging from image classification to audio classification. That’s because their annotations expose comprehensible patterns to machine learning algorithms, in effect telling machines what to look for in future datasets so they’re able to make predictions.
But while labeled data is usually equated with ground truth, datasets can — and do — contain errors. The processes used to construct corpora often involve some degree of automatic annotation or crowdsourcing techniques that are inherently error-prone. This becomes especially problematic when these errors reach test sets, the subsets of datasets researchers use to compare progress and validate their findings. Labeling errors here could lead scientists to draw incorrect conclusions about which models perform best in the real world, potentially undermining the framework by which the community benchmarks machine learning systems.
Above: A chart showing the percentage of labeling errors in popular AI benchmark datasets.
In choosing which datasets to audit, the researchers looked at the most-used open source datasets created in the last 20 years, with a preference for diversity across computer vision, natural language processing, sentiment analysis, and audio modalities. In total, they evaluated six image datasets (MNIST, CIFAR-10, CIFAR-100, Caltech-256, and ImageNet), three text datasets (20news, IMDB, and Amazon Reviews), and one audio dataset (AudioSet).
Errors included:
A previous study out of MIT found that ImageNet has “systematic annotation issues” and is misaligned with ground truth or direct observation when used as a benchmark dataset. The coauthors of that research concluded that about 20% of ImageNet photos contain multiple objects, leading to a drop in accuracy as high as 10% among models trained on the dataset.
In an experiment, the researchers filtered out the erroneous labels in ImageNet and benchmarked a number of models on the corrected set. The results were largely unchanged, but when the models were evaluated only on the erroneous data, those that performed best on the original, incorrect labels were found to perform the worst on the correct labels. The implication is that the models learned to capture systematic patterns of label error in order to improve their original test accuracy.
Above: A Chihuahua mislabeled as a feather boa in ImageNet.
In a follow-up experiment, the coauthors created an error-free CIFAR-10 test set to measure AI models for “corrected” accuracy. The results show that powerful models didn’t reliably perform better than their simpler counterparts because performance was correlated with the degree of labeling errors. For datasets where errors are common, data scientists might be misled to select a model that isn’t actually the best model in terms of corrected accuracy, the study’s coauthors say.
To promote more accurate benchmarks, the researchers have released a cleaned version of each test set in which a large portion of the label errors have been corrected. The team recommends that data scientists measure the real-world accuracy they care about in practice and consider using simpler models for datasets with error-prone labels, especially for algorithms trained or evaluated with noisy labeled data.
Creating datasets in a privacy-preserving, ethical way remains a major blocker for researchers in the AI community, particularly those who specialize in computer vision. In January 2019, IBM released a corpus designed to mitigate bias in facial recognition algorithms that contained nearly a million photos of people from Flickr. But IBM failed to notify either the photographers or the subjects of the photos that their work would be canvassed. Separately, an earlier version of ImageNet, a dataset used to train AI systems around the world, was found to contain photos of naked children, porn actresses, college parties, and more — all scraped from the web without those individuals’ consent.
In July 2020, the creators of the 80 Million Tiny Images dataset from MIT and NYU took the collection offline, apologized, and asked other researchers to refrain from using the dataset and to delete any existing copies. Introduced in 2006 and containing photos scraped from internet search engines, 80 Million Tiny Images was found to have a range of racist, sexist, and otherwise offensive annotations, such as nearly 2,000 images labeled with the N-word, and labels like “rape suspect” and “child molester.” The dataset also contained pornographic content like nonconsensual photos taken up women’s skirts.
Some in the AI community are taking steps to build less problematic corpora. The ImageNet creators said they plan to remove virtually all of about 2,800 categories in the “person” subtree of the dataset, which were found to poorly represent people from the Global South. And this week, the group released a version of the dataset that blurs people’s faces in order to support privacy experimentation.
© 2021 LeackStat.com
2025 © Leackstat. All rights reserved