Most real-world classification tasks suffer from label noise to some extent. Such noise in the data adversely affects the generalization error of learned models and complicates the evaluation of noise-handling methods, as their performance cannot be accurately measured without clean labels. In label noise research, typically either noisy or incomplex simulated data are accepted as a baseline, into which additional noise with known properties is injected. In this paper, we propose SYNLABEL, a framework that aims to improve upon the aforementioned methodologies. It allows for creating a noiseless dataset informed by real data, by either pre-specifying or learning a function and defining it as the ground truth function from which labels are generated. Furthermore, by resampling a number of values for selected features in the function domain, evaluating the function and aggregating the resulting labels, each data point can be assigned a soft label or label distribution. Such distributions allow for direct injection and quantification of label noise. The generated datasets serve as a clean baseline of adjustable complexity into which different types of noise may be introduced. We illustrate how the framework can be applied, how it enables quantification of label noise and how it improves over existing methodologies.
翻译:大多数现实世界的分类任务在某种程度上都存在标签噪声。数据中的这种噪声会对所学模型的泛化误差产生不利影响,并使得噪声处理方法的效果评估变得复杂,因为如果没有干净标签,其性能无法准确衡量。在标签噪声研究中,通常将带有噪声或不复杂的模拟数据作为基线,并在此基础上注入具有已知特性的额外噪声。本文提出SYNLABEL框架,旨在改进上述方法。该框架允许根据真实数据创建无噪声数据集,通过预先指定或学习一个函数并将其定义为生成标签的真实函数。此外,通过在函数域中对选定特征的值进行重采样、评估函数并聚合生成的标签,可以为每个数据点分配软标签或标签分布。这种分布使得标签噪声的直接注入和量化成为可能。生成的数据集可作为可调整复杂度的干净基线,并允许引入不同类型的噪声。我们阐述了如何应用该框架、如何实现标签噪声的量化,以及它如何优于现有方法。