Label noise is a common problem in real-world datasets, affecting both model training and validation. Clean data are essential for achieving strong performance and ensuring reliable evaluation. While various techniques have been proposed to detect noisy labels, there is no clear consensus on optimal approaches. We perform a comprehensive benchmark of detection methods by decomposing them into three fundamental components: label agreement function, aggregation method, and information gathering approach (in-sample vs out-of-sample). This decomposition can be applied to many existing detection methods, and enables systematic comparison across diverse approaches. To fairly compare methods, we propose a unified benchmark task, detecting a fraction of training samples equal to the dataset's noise rate. We also introduce a novel metric: the false negative rate at this fixed operating point. Our evaluation spans vision and tabular datasets under both synthetic and real-world noise conditions. We identify that in-sample information gathering using average probability aggregation combined with the logit margin as the label agreement function achieves the best results across most scenarios. Our findings provide practical guidance for designing new detection methods and selecting techniques for specific applications.
翻译:标签噪声是现实世界数据集中普遍存在的问题,既影响模型训练也影响验证过程。干净数据对于实现优异性能和确保可靠评估至关重要。尽管已有多种技术被提出用于检测噪声标签,但关于最优方法尚未形成明确共识。我们通过将检测方法分解为三个基本组成部分——标签一致性函数、聚合方法和信息收集方式(样本内与样本外),进行了全面的方法基准测试。这种分解可应用于众多现有检测方法,并支持跨不同方法的系统性比较。为公平比较各方法,我们提出了统一的基准任务:检测训练样本中与数据集噪声率等比例的部分。同时引入了一种新颖的评估指标:在此固定操作点下的假阴性率。我们的评估涵盖视觉和表格数据集在合成噪声与现实噪声条件下的表现。研究发现,使用平均概率聚合结合对数间隔作为标签一致性函数的样本内信息收集方式,在多数场景中取得了最佳效果。本研究结果为设计新检测方法及针对特定应用选择技术提供了实践指导。