Major advancements in computer vision can primarily be attributed to the use of labeled datasets. However, acquiring labels for datasets often results in errors which can harm model performance. Recent works have proposed methods to automatically identify mislabeled images, but developing strategies to effectively implement them in real world datasets has been sparsely explored. Towards improved data-centric methods for cleaning real world vision datasets, we first conduct more than 200 experiments carefully benchmarking recently developed automated mislabel detection methods on multiple datasets under a variety of synthetic and real noise settings with varying noise levels. We compare these methods to a Simple and Efficient Mislabel Detector (SEMD) that we craft, and find that SEMD performs similarly to or outperforms prior mislabel detection approaches. We then apply SEMD to multiple real world computer vision datasets and test how dataset size, mislabel removal strategy, and mislabel removal amount further affect model performance after retraining on the cleaned data. With careful design of the approach, we find that mislabel removal leads per-class performance improvements of up to 8% of a retrained classifier in smaller data regimes.
翻译:计算机视觉领域的重大进步主要归功于标注数据集的使用。然而,为数据集获取标签时常会引入错误,进而损害模型性能。近期研究提出了自动识别误标注图像的方法,但如何在实际数据集中有效实施这些策略仍缺乏深入探索。为改进面向真实世界视觉数据集清洗的数据中心化方法,我们首先开展了200余项实验,系统对比了近年开发的自动误标签检测方法在多类数据集上的表现——涵盖不同噪声水平下的合成噪声与真实噪声场景。我们将这些方法与自主设计的简单高效误标签检测器(SEMD)进行比较,发现SEMD的性能与已有方法相当或更优。随后,我们将SEMD应用于多个真实世界计算机视觉数据集,系统测试了数据集规模、误标签移除策略及移除量对清洗数据重训练后模型性能的进一步影响。通过精心设计方法,我们发现在小数据规模场景下,误标签移除可使重训练分类器的逐类性能提升最高达8%。