Label noise, commonly found in real-world datasets, has a detrimental impact on a model's generalization. To effectively detect incorrectly labeled instances, previous works have mostly relied on distinguishable training signals, such as training loss, as indicators to differentiate between clean and noisy labels. However, they have limitations in that the training signals incompletely reveal the model's behavior and are not effectively generalized to various noise types, resulting in limited detection accuracy. In this paper, we propose DynaCor framework that distinguishes incorrectly labeled instances from correctly labeled ones based on the dynamics of the training signals. To cope with the absence of supervision for clean and noisy labels, DynaCor first introduces a label corruption strategy that augments the original dataset with intentionally corrupted labels, enabling indirect simulation of the model's behavior on noisy labels. Then, DynaCor learns to identify clean and noisy instances by inducing two clearly distinguishable clusters from the latent representations of training dynamics. Our comprehensive experiments show that DynaCor outperforms the state-of-the-art competitors and shows strong robustness to various noise types and noise rates.
翻译:现实世界数据集中普遍存在的标签噪声对模型的泛化能力具有不利影响。为有效检测错误标注的实例,先前研究主要依赖可区分的训练信号(如训练损失)作为区分干净标签与噪声标签的指标。然而,这些方法存在局限性:训练信号不能完整揭示模型行为,且无法有效泛化至多种噪声类型,导致检测精度受限。本文提出DynaCor框架,该框架基于训练信号的动态特性来区分错误标注与正确标注的实例。针对干净标签与噪声标签缺乏监督的问题,DynaCor首先引入标签破坏策略,通过故意破坏标签来增强原始数据集,从而间接模拟模型在噪声标签上的行为。随后,DynaCor通过学习从训练动态的潜在表示中诱导出两个清晰可分的聚类,实现对干净与噪声实例的识别。综合实验表明,DynaCor在检测性能上优于现有最优方法,并对多种噪声类型及噪声率表现出强鲁棒性。