Label noise is one of the key factors that lead to the poor generalization of deep learning models. Existing label-noise learning methods usually assume that the ground-truth classes of the training data are balanced. However, the real-world data is often imbalanced, leading to the inconsistency between observed and intrinsic class distribution with label noises. In this case, it is hard to distinguish clean samples from noisy samples on the intrinsic tail classes with the unknown intrinsic class distribution. In this paper, we propose a learning framework for label-noise learning with intrinsically long-tailed data. Specifically, we propose two-stage bi-dimensional sample selection (TABASCO) to better separate clean samples from noisy samples, especially for the tail classes. TABASCO consists of two new separation metrics that complement each other to compensate for the limitation of using a single metric in sample separation. Extensive experiments on benchmarks demonstrate the effectiveness of our method. Our code is available at https://github.com/Wakings/TABASCO.
翻译:标签噪声是导致深度学习模型泛化能力不足的关键因素之一。现有的标签噪声学习方法通常假设训练数据的真实类别是平衡的。然而,现实世界中的数据往往是不平衡的,这导致在标签噪声影响下观测到的类分布与内在类分布不一致。在这种情形下,由于内在类分布未知,很难从内在尾类中区分干净样本与噪声样本。本文针对内在长尾数据的标签噪声学习问题,提出了一种学习框架。具体而言,我们提出两阶段二维样本选择方法(TABASCO),以更好地区分干净样本与噪声样本,尤其是针对尾类。TABASCO包含两个相互补充的新分离度量,以弥补在样本分离中仅使用单一度量存在的局限性。在基准数据集上的大量实验验证了本方法的有效性。我们的代码已开源在https://github.com/Wakings/TABASCO。