Learning with noisy labels (LNL) is challenging as the model tends to memorize noisy labels, which can lead to overfitting. Many LNL methods detect clean samples by maximizing the similarity between samples in each category, which does not make any assumptions about likely noise sources. However, we often have some knowledge about the potential source(s) of noisy labels. For example, an image mislabeled as a cheetah is more likely a leopard than a hippopotamus due to their visual similarity. Thus, we introduce a new task called Learning with Noisy Labels and noise source distribution Knowledge (LNL+K), which assumes we have some knowledge about likely source(s) of label noise that we can take advantage of. By making this presumption, methods are better equipped to distinguish hard negatives between categories from label noise. In addition, this enables us to explore datasets where the noise may represent the majority of samples, a setting that breaks a critical premise of most methods developed for the LNL task. We explore several baseline LNL+K approaches that integrate noise source knowledge into state-of-the-art LNL methods across three diverse datasets and three types of noise, where we report a 5-15% boost in performance compared with the unadapted methods. Critically, we find that LNL methods do not generalize well in every setting, highlighting the importance of directly exploring our LNL+K task.
翻译:带噪声标签的学习(LNL)具有挑战性,因为模型倾向于记忆噪声标签,从而导致过拟合。许多LNL方法通过最大化每类样本间的相似性来检测干净样本,这些方法对可能的噪声来源不作任何假设。然而,我们通常对噪声标签的潜在来源有一定了解。例如,一张被误标为猎豹的图片,由于其视觉相似性,更可能是豹子而非河马。因此,我们提出一项新任务——带噪声标签与噪声源分布知识的学习(LNL+K),该任务假设我们拥有关于标签噪声可能来源的先验知识可加以利用。基于这一假设,方法能更好地区分类别间的难负样本与标签噪声。此外,这使得我们能探索噪声可能占样本多数(这打破了大多数LNL方法的关键前提)的数据集。我们探索了多种基线LNL+K方法,将噪声源知识融入三个不同数据集与三种噪声类型下的现有最优LNL方法中,相较于未适配的方法,性能提升5-15%。关键在于,我们发现LNL方法并非在所有场景下均能良好泛化,这凸显了直接探索LNL+K任务的重要性。