We investigate the problem of learning with noisy labels in real-world annotation scenarios, where noise can be categorized into two types: factual noise and ambiguity noise. To better distinguish these noise types and utilize their semantics, we propose a novel sample selection-based approach for noisy label learning, called Proto-semi. Proto-semi initially divides all samples into the confident and unconfident datasets via warm-up. By leveraging the confident dataset, prototype vectors are constructed to capture class characteristics. Subsequently, the distances between the unconfident samples and the prototype vectors are calculated to facilitate noise classification. Based on these distances, the labels are either corrected or retained, resulting in the refinement of the confident and unconfident datasets. Finally, we introduce a semi-supervised learning method to enhance training. Empirical evaluations on a real-world annotated dataset substantiate the robustness of Proto-semi in handling the problem of learning from noisy labels. Meanwhile, the prototype-based repartitioning strategy is shown to be effective in mitigating the adverse impact of label noise. Our code and data are available at https://github.com/fuxiAIlab/ProtoSemi.
翻译:我们研究了真实标注场景下带噪标签学习的问题,其中噪声可归为两类:事实性噪声和歧义性噪声。为更好区分这两类噪声并利用其语义,我们提出一种基于样本选择的带噪标签学习方法——Proto-semi。该方法首先通过预热阶段将所有样本划分为置信集和非置信集。基于置信集构建原型向量以捕捉类别特征,随后计算非置信样本与原型向量之间的距离进行噪声分类。根据这些距离,对标签进行修正或保留,从而优化置信集与非置信集的划分。最后,引入半监督学习方法增强训练效果。在真实标注数据集上的实验验证了Proto-semi处理带噪标签学习问题的鲁棒性,同时表明基于原型的重划分策略能有效减轻标签噪声的负面影响。我们的代码与数据开源地址为 https://github.com/fuxiAIlab/ProtoSemi。