An Empirical Study of Automated Mislabel Detection in Real World Vision Datasets

Major advancements in computer vision can primarily be attributed to the use of labeled datasets. However, acquiring labels for datasets often results in errors which can harm model performance. Recent works have proposed methods to automatically identify mislabeled images, but developing strategies to effectively implement them in real world datasets has been sparsely explored. Towards improved data-centric methods for cleaning real world vision datasets, we first conduct more than 200 experiments carefully benchmarking recently developed automated mislabel detection methods on multiple datasets under a variety of synthetic and real noise settings with varying noise levels. We compare these methods to a Simple and Efficient Mislabel Detector (SEMD) that we craft, and find that SEMD performs similarly to or outperforms prior mislabel detection approaches. We then apply SEMD to multiple real world computer vision datasets and test how dataset size, mislabel removal strategy, and mislabel removal amount further affect model performance after retraining on the cleaned data. With careful design of the approach, we find that mislabel removal leads per-class performance improvements of up to 8% of a retrained classifier in smaller data regimes.

翻译：计算机视觉领域的重大进步主要归功于标注数据集的使用。然而，为数据集获取标签时常会引入错误，进而损害模型性能。近期研究提出了自动识别误标注图像的方法，但如何在实际数据集中有效实施这些策略仍缺乏深入探索。为改进面向真实世界视觉数据集清洗的数据中心化方法，我们首先开展了200余项实验，系统对比了近年开发的自动误标签检测方法在多类数据集上的表现——涵盖不同噪声水平下的合成噪声与真实噪声场景。我们将这些方法与自主设计的简单高效误标签检测器（SEMD）进行比较，发现SEMD的性能与已有方法相当或更优。随后，我们将SEMD应用于多个真实世界计算机视觉数据集，系统测试了数据集规模、误标签移除策略及移除量对清洗数据重训练后模型性能的进一步影响。通过精心设计方法，我们发现在小数据规模场景下，误标签移除可使重训练分类器的逐类性能提升最高达8%。

相关内容

Automator

关注 5

Automator是苹果公司为他们的Mac OS X系统开发的一款软件。 只要通过点击拖拽鼠标等操作就可以将一系列动作组合成一个工作流，从而帮助你自动的（可重复的）完成一些复杂的工作。Automator还能横跨很多不同种类的程序，包括：查找器、Safari网络浏览器、iCal、地址簿或者其他的一些程序。它还能和一些第三方的程序一起工作，如微软的Office、Adobe公司的Photoshop或者Pixelmator等。

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日