Label noise - incorrect labels assigned to observations - can substantially degrade the performance of supervised classifiers. This paper proposes a label noise cleaning method based on Bernoulli random sampling. We show that the mean label noise levels of subsets generated by Bernoulli random sampling containing a given observation are identically distributed for all clean observations, and identically distributed, with a different distribution, for all noisy observations. Although the mean label noise levels are not independent across observations, by introducing an independent coupling we further prove that they converge to a mixture of two well-separated distributions corresponding to clean and noisy observations. By establishing a linear model between cross-validated classification errors and label noise levels, we are able to approximate this mixture distribution and thereby separate clean and noisy observations without any prior label information. The proposed method is classifier-agnostic, theoretically justified, and demonstrates strong performance on both simulated and real datasets.
翻译:标签噪声——即分配给观测样本的错误标签——会显著降低监督分类器的性能。本文提出一种基于伯努利随机采样的标签噪声清洗方法。我们证明,对于所有干净样本,通过伯努利随机采样生成的包含给定观测样本的子集,其平均标签噪声水平具有相同分布;而对于所有噪声样本,其平均标签噪声水平虽具有相同分布,但属于不同的分布族。尽管不同观测样本间的平均标签噪声水平并非相互独立,但通过引入独立耦合,我们进一步证明它们会收敛到对应于干净样本与噪声样本的两个充分分离的混合分布。通过建立交叉验证分类误差与标签噪声水平之间的线性模型,我们能够近似该混合分布,从而在无需任何先验标签信息的情况下分离干净样本与噪声样本。所提方法具有分类器无关性、理论可解释性,并在模拟和真实数据集上均表现出优越性能。