Extracting noisy or incorrectly labeled samples from a labeled dataset with hard/difficult samples is an important yet under-explored topic. Two general and often independent lines of work exist, one focuses on addressing noisy labels, and another deals with hard samples. However, when both types of data are present, most existing methods treat them equally, which results in a decline in the overall performance of the model. In this paper, we first design various synthetic datasets with custom hardness and noisiness levels for different samples. Our proposed systematic empirical study enables us to better understand the similarities and more importantly the differences between hard-to-learn samples and incorrectly-labeled samples. These controlled experiments pave the way for the development of methods that distinguish between hard and noisy samples. Through our study, we introduce a simple yet effective metric that filters out noisy-labeled samples while keeping the hard samples. We study various data partitioning methods in the presence of label noise and observe that filtering out noisy samples from hard samples with this proposed metric results in the best datasets as evidenced by the high test accuracy achieved after models are trained on the filtered datasets. We demonstrate this for both our created synthetic datasets and for datasets with real-world label noise. Furthermore, our proposed data partitioning method significantly outperforms other methods when employed within a semi-supervised learning framework.
翻译:从包含困难样本的标注数据集中提取噪声或错误标注样本是一个重要但尚未充分探索的课题。现有研究通常沿着两条独立主线展开:一条专注于处理噪声标签,另一条则应对困难样本。然而,当两类数据同时存在时,多数现有方法将它们同等对待,导致模型整体性能下降。本文首先设计了具有自定义难度和噪声水平的多种合成数据集。我们提出的系统性实证研究使人们能够更深入地理解难学样本与错误标注样本之间的相似性,更重要的是揭示其差异。这些受控实验为开发区分困难样本与噪声样本的方法奠定了基础。通过研究,我们提出了一种简单而有效的度量指标,能够在保留困难样本的同时滤除噪声标注样本。我们研究了存在标签噪声时的多种数据划分方法,并观察到:采用该度量指标从困难样本中滤除噪声样本后,所得到的数据集质量最优——基于过滤后数据集训练的模型取得了更高的测试准确率,这分别在所构建的合成数据集和真实世界标签噪声数据集上得到了验证。此外,在半监督学习框架下应用时,我们提出的数据划分方法显著优于其他方法。