Noisy label learning aims to train deep neural networks using a large amount of samples with noisy labels, whose main challenge comes from how to deal with the inaccurate supervision caused by wrong labels. Existing works either take the label correction or sample selection paradigm to involve more samples with accurate labels into the training process. In this paper, we propose a simple yet effective sample selection algorithm, termed as Pairwise Similarity Distribution Clustering~(PSDC), to divide the training samples into one clean set and another noisy set, which can power any of the off-the-shelf semi-supervised learning regimes to further train networks for different downstream tasks. Specifically, we take the pairwise similarity between sample pairs to represent the sample structure, and the Gaussian Mixture Model~(GMM) to model the similarity distribution between sample pairs belonging to the same noisy cluster, therefore each sample can be confidently divided into the clean set or noisy set. Even under severe label noise rate, the resulting data partition mechanism has been proved to be more robust in judging the label confidence in both theory and practice. Experimental results on various benchmark datasets, such as CIFAR-10, CIFAR-100 and Clothing1M, demonstrate significant improvements over state-of-the-art methods.
翻译:噪声标签学习旨在利用大量带有噪声标签的样本训练深度神经网络,其主要挑战在于如何处理由错误标签导致的不准确监督。现有工作要么采用标签校正范式,要么采用样本选择范式,以将更多具有准确标签的样本纳入训练过程。本文提出一种简单而有效的样本选择算法——成对相似度分布聚类(PSDC),将训练样本划分为干净集和噪声集,该算法可赋能任何现成的半监督学习框架,以进一步训练网络应对不同下游任务。具体而言,我们利用样本对的成对相似度表征样本结构,并采用高斯混合模型(GMM)对属于同一噪声簇的样本对的相似度分布进行建模,从而每个样本均可被可靠地划分至干净集或噪声集。即使在严重的标签噪声率下,该数据划分机制在理论和实践中均被证明能更稳健地判断标签置信度。在CIFAR-10、CIFAR-100和Clothing1M等多个基准数据集上的实验结果表明,该方法较现有最优方法具有显著提升。