Despite deep learning has achieved great success, it often relies on a large amount of training data with accurate labels, which are expensive and time-consuming to collect. A prominent direction to reduce the cost is to learn with noisy labels, which are ubiquitous in the real-world applications. A critical challenge for such a learning task is to reduce the effect of network memorization on the falsely-labeled data. In this work, we propose an iterative selection approach based on the Weibull mixture model, which identifies clean data by considering the overall learning dynamics of each data instance. In contrast to the previous small-loss heuristics, we leverage the observation that deep network is easy to memorize and hard to forget clean data. In particular, we measure the difficulty of memorization and forgetting for each instance via the transition times between being misclassified and being memorized in training, and integrate them into a novel metric for selection. Based on the proposed metric, we retain a subset of identified clean data and repeat the selection procedure to iteratively refine the clean subset, which is finally used for model training. To validate our method, we perform extensive experiments on synthetic noisy datasets and real-world web data, and our strategy outperforms existing noisy-label learning methods.
翻译:尽管深度学习取得了巨大成功,但它通常依赖于大量带有精确标签的训练数据,而获取这些数据的成本高昂且耗时。降低成本的显著途径之一是使用含噪标签进行学习,这在现实应用中普遍存在。此类学习任务的关键挑战在于减少网络对错误标签数据的记忆效应。本文提出一种基于威布尔混合模型的迭代选择方法,该方法通过考虑每个数据实例的整体学习动态来识别干净数据。与先前的小损失启发式方法不同,我们利用深度网络易记忆难遗忘干净数据的观察。具体而言,我们通过衡量每个实例在训练中被错误分类和被记忆之间的过渡时间,来评估其记忆与遗忘的难度,并将其整合为一种新的选择度量。基于所提度量,我们保留识别的干净数据子集,并重复选择过程以迭代优化干净子集,最终用于模型训练。为验证方法有效性,我们在合成噪声数据集和真实网络数据上进行了广泛实验,所提策略优于现有含噪标签学习方法。