Label errors abound in the most common AI test sets
A new study from the Massachusetts Institute of Technology found label errors in ten of the most cited artificial intelligence data test sets. Researchers estimated an average of 3.4% errors across the datasets, cautioning that this could destabilize machine learning benchmarks.
"Researchers rely on benchmark test datasets to evaluate and measure progress in the state-of-the-art and to validate theoretical findings," wrote the study authors.
"If label errors occurred profusely, they could potentially undermine the framework by which we measure progress in machine learning," they continued.
WHY IT MATTERS
As MIT Technology Review senior AI reporter Karen Hao noted in a write-up about the study researchers use a core set of data sets to evaluate ML models and track AI capability over time.
There are known issues with many of these sets, Hao wrote, including racist and sexist labels. However, the new study finds that many of the labels are simply wrong as well.
For instance, researchers found that a photo of a frog in CIFAR-10, a visual dataset, was erroneously labeled as a cat. In the commonly used ImageNet validation set, a lion was labeled as a patas monkey, a dock was labeled as a paper towel, and giant pandas were repeatedly labeled as red pandas.
And in QuickDraw, a collection of 50 million drawings across 345 categories, an eye was labeled as a tiger, a lightbulb was labeled as a tiger, and an apple was labeled as a t-shirt.
In total, researchers found 2,916 errors in the ImageNet validation set, or 6% – and estimated over 10% errors in QuickDraw.
"Traditionally, ML practitioners choose which model to deploy based on test accuracy – our findings advise caution here, proposing that judging models over correctly labeled test sets may be more useful, especially for noisy real-world datasets," read the study. Researchers noted that after labels were corrected, the models that didn't perform as well on the incorrect labels were some of the best performers.
"In particular, the simpler models seemed to fare better on the corrected data than the more complicated models that are used by tech giants like Google for image recognition and assumed to be the best in the field," wrote Hao.
"In other words, we may have an inflated sense of how great these complicated models are because of flawed testing data."
THE LARGER TREND
"From our experience, most healthcare organizations do not evaluate algorithms in the context of their intended use," said Dr. Sujay Kakarmath, a digital health scientist at Partners Healthcare, in an interview with Healthcare IT News in 2018.
"The technical performance of an algorithm for a given task is far from being the only metric that determines its potential impact," Kakarmath added.
ON THE RECORD
"Whereas train set labels in a small number of machine learning datasets, e.g. in the ImageNet dataset, are well-known to contain errors, labeled data in test sets is often considered 'correct' as long as it is drawn from the same distribution as the train set," wrote the research team in the study.
"This is a fallacy – machine learning test sets can, and do, contain pervasive errors and these errors can destabilize ML benchmarks," they added.