Label noise presents a fundamental challenge in modern machine learning, especially when large-scale datasets are generated via automated processes. An increasingly common and important data paradigm, particularly in domains like medical imaging, involves learning from a large dataset with coarse, noisy labels supplemented by a small, expert-verified, clean dataset. This setting constitutes a typical information transfer and fusion problem. However, the significant distribution shift between the noisy and clean data violates the core overall parametric similarity assumptions of existing statistical transfer learning methods, while their reliance on parametric models is ill-suited for complex data like images. To address these limitations, this paper develops a generic model-agnostic nonparametric framework for classification with label noise, which applies to a broad class of classifiers. Our approach leverages the small clean dataset to ``purify'' the large noisy one and carefully manages the remaining ambiguous samples. This framework is underpinned by a rigorous statistical theory. Its empirical performance is demonstrated through simulations and a real-world application to medical image analysis for pneumonia diagnosis.
翻译:标签噪声是现代机器学习中的一项基本挑战,尤其是在通过自动化流程生成大规模数据集时。一个日益常见且重要的数据范式(尤其在医学影像等领域)涉及利用带有粗粒度噪声标签的大规模数据集,并辅以少量经专家验证的干净数据集进行学习。这一设定构成了典型的信息传输与融合问题。然而,噪声数据与干净数据之间的显著分布偏移违反了现有统计迁移学习方法的核心整体参数相似性假设,同时这些方法对参数化模型的依赖也不适用于图像等复杂数据。为解决这些局限性,本文提出了一种通用的模型无关非参数框架,用于处理带有标签噪声的分类问题,该框架适用于广泛的分类器。我们的方法利用少量干净数据集来“净化”大量噪声数据集,并审慎地管理剩余的歧义样本。该框架以严谨的统计理论为基础,其经验性能通过模拟实验和在肺炎诊断医学图像分析中的实际应用得到了验证。