Despite deep learning has achieved great success, it often relies on a large amount of training data with accurate labels, which are expensive and time-consuming to collect. A prominent direction to reduce the cost is to learn with noisy labels, which are ubiquitous in the real-world applications. A critical challenge for such a learning task is to reduce the effect of network memorization on the falsely-labeled data. In this work, we propose an iterative selection approach based on the Weibull mixture model, which identifies clean data by considering the overall learning dynamics of each data instance. In contrast to the previous small-loss heuristics, we leverage the observation that deep network is easy to memorize and hard to forget clean data. In particular, we measure the difficulty of memorization and forgetting for each instance via the transition times between being misclassified and being memorized in training, and integrate them into a novel metric for selection. Based on the proposed metric, we retain a subset of identified clean data and repeat the selection procedure to iteratively refine the clean subset, which is finally used for model training. To validate our method, we perform extensive experiments on synthetic noisy datasets and real-world web data, and our strategy outperforms existing noisy-label learning methods.
翻译:摘要:尽管深度学习取得了巨大成功,但其通常依赖于大量带有准确标签的训练数据,而此类数据的采集成本高昂且耗时。降低这一成本的重要途径之一是采用含噪标签进行学习——这在现实应用中普遍存在。此类学习任务的关键挑战在于减少网络对错误标注数据的记忆效应。本文提出一种基于威布尔混合模型的迭代选择方法,通过分析每个数据实例的整体学习动态来识别干净数据。与以往基于小样本损失值的启发式方法不同,我们利用了深度网络更易记忆干净数据而难以遗忘干净数据的观测结果。具体而言,我们通过训练过程中实例在错误分类与正确记忆状态间的转换次数,测量每个实例的"记忆难度"与"遗忘难度",并将其整合为新的选择度量标准。基于该度量标准,我们保留部分被识别的干净数据子集,并通过迭代选择过程逐步优化该子集,最终用于模型训练。为验证方法有效性,我们在合成含噪数据集与真实网络数据上开展了大量实验,结果表明本策略优于现有含噪标签学习方法。