Knockoffs-SPR: Clean Sample Selection in Learning with Noisy Labels

A noisy training set usually leads to the degradation of the generalization and robustness of neural networks. In this paper, we propose a novel theoretically guaranteed clean sample selection framework for learning with noisy labels. Specifically, we first present a Scalable Penalized Regression (SPR) method, to model the linear relation between network features and one-hot labels. In SPR, the clean data are identified by the zero mean-shift parameters solved in the regression model. We theoretically show that SPR can recover clean data under some conditions. Under general scenarios, the conditions may be no longer satisfied; and some noisy data are falsely selected as clean data. To solve this problem, we propose a data-adaptive method for Scalable Penalized Regression with Knockoff filters (Knockoffs-SPR), which is provable to control the False-Selection-Rate (FSR) in the selected clean data. To improve the efficiency, we further present a split algorithm that divides the whole training set into small pieces that can be solved in parallel to make the framework scalable to large datasets. While Knockoffs-SPR can be regarded as a sample selection module for a standard supervised training pipeline, we further combine it with a semi-supervised algorithm to exploit the support of noisy data as unlabeled data. Experimental results on several benchmark datasets and real-world noisy datasets show the effectiveness of our framework and validate the theoretical results of Knockoffs-SPR. Our code and pre-trained models are available at https://github.com/Yikai-Wang/Knockoffs-SPR.

翻译：噪声训练集通常会导致神经网络的泛化能力和鲁棒性下降。本文提出了一种新颖且具有理论保证的干净样本选择框架，用于处理噪声标签学习问题。具体而言，我们首先提出了一种可扩展惩罚回归（SPR）方法，用于建模网络特征与独热标签之间的线性关系。在SPR中，通过回归模型中求解的零均值偏移参数来识别干净数据。我们理论上证明了SPR能在某些条件下恢复干净数据。在一般场景下，这些条件可能不再成立，导致部分噪声数据被误选为干净数据。为解决该问题，我们提出了一种数据自适应的可扩展惩罚回归伴随Knockoff滤波器方法（Knockoffs-SPR），该方法可证明控制所选干净数据中的错误选择率（FSR）。为提升效率，我们进一步提出了一种分割算法，将整个训练集划分为可并行求解的小块，使框架能够扩展至大规模数据集。Knockoffs-SPR既可作为标准监督训练流程的样本选择模块，我们还可将其与半监督算法结合，将噪声数据作为未标记数据加以利用。在多个基准数据集和真实噪声数据集上的实验结果展示了本框架的有效性，并验证了Knockoffs-SPR的理论结果。我们的代码与预训练模型已开源在https://github.com/Yikai-Wang/Knockoffs-SPR。