Over-Fit: Noisy-Label Detection based on the Overfitted Model Property

Due to the increasing need to handle the noisy label problem in a massive dataset, learning with noisy labels has received much attention in recent years. As a promising approach, there have been recent studies to select clean training data by finding small-loss instances before a deep neural network overfits the noisy-label data. However, it is challenging to prevent overfitting. In this paper, we propose a novel noisy-label detection algorithm by employing the property of overfitting on individual data points. To this end, we present two novel criteria that statistically measure how much each training sample abnormally affects the model and clean validation data. Using the criteria, our iterative algorithm removes noisy-label samples and retrains the model alternately until no further performance improvement is made. In experiments on multiple benchmark datasets, we demonstrate the validity of our algorithm and show that our algorithm outperforms the state-of-the-art methods when the exact noise rates are not given. Furthermore, we show that our method can not only be expanded to a real-world video dataset but also can be viewed as a regularization method to solve problems caused by overfitting.

翻译：由于越来越需要在一个庞大的数据集中处理噪音标签问题,近年来对噪音标签的学习引起了人们的极大关注。作为一种很有希望的方法,最近进行了一些研究,通过在深神经网络覆盖噪音标签数据之前找到小损失案例来选择清洁的培训数据。然而,要防止过度配制是十分困难的。在本文中,我们建议采用新的噪音标签检测算法,在单个数据点上使用过度装配的特性。为此,我们提出了两个新的标准,在统计上衡量每个训练样本对模型和清洁验证数据的影响程度异常。使用这些标准,我们的迭代算法去除噪音标签样本,并在没有进一步改进性能之前,轮流对模型进行重复。在多个基准数据集的实验中,我们展示了我们的算法的有效性,并表明我们的算法在没有给出确切的噪音率时,超越了最新的方法。此外,我们表明我们的方法不仅可以扩大到真实世界的录像数据集,还可以被视为一种正规化的方法,以解决过分配配制造成的问题。

相关内容

过拟合

关注 8

过拟合，在AI领域多指机器学习得到模型太过复杂，导致在训练集上表现很好，然而在测试集上却不尽人意。过拟合（over-fitting）也称为过学习，它的直观表现是算法在训练集上表现好，但在测试集上表现不好，泛化性能差。过拟合是在模型参数拟合过程中由于训练数据包含抽样误差，在训练时复杂的模型将抽样误差也进行了拟合导致的。

多标签学习的新趋势（2020 Survey）

专知会员服务

44+阅读 · 2020年12月6日