Though Self-supervised learning (SSL) has been widely studied as a promising technique for representation learning, it doesn't generalize well on long-tailed datasets due to the majority classes dominating the feature space. Recent work shows that the long-tailed learning performance could be boosted by sampling extra in-domain (ID) data for self-supervised training, however, large-scale ID data which can rebalance the minority classes are expensive to collect. In this paper, we propose an alternative but easy-to-use and effective solution, Contrastive with Out-of-distribution (OOD) data for Long-Tail learning (COLT), which can effectively exploit OOD data to dynamically re-balance the feature space. We empirically identify the counter-intuitive usefulness of OOD samples in SSL long-tailed learning and principally design a novel SSL method. Concretely, we first localize the `head' and `tail' samples by assigning a tailness score to each OOD sample based on its neighborhoods in the feature space. Then, we propose an online OOD sampling strategy to dynamically re-balance the feature space. Finally, we enforce the model to be capable of distinguishing ID and OOD samples by a distribution-level supervised contrastive loss. Extensive experiments are conducted on various datasets and several state-of-the-art SSL frameworks to verify the effectiveness of the proposed method. The results show that our method significantly improves the performance of SSL on long-tailed datasets by a large margin, and even outperforms previous work which uses external ID data. Our code is available at https://github.com/JianhongBai/COLT.
翻译:虽然自监督学习(SSL)作为表示学习的一种有前景技术已被广泛研究,但由于多数类主导特征空间,其在长尾数据集上的泛化能力不佳。近期研究表明,通过采样额外的域内(ID)数据进行自监督训练可提升长尾学习性能,然而能够重新平衡少数类的大规模域内数据采集成本高昂。本文提出一种替代性且简便易行、效果显著的解决方案——面向长尾学习的分布外对比学习(COLT),该方法能有效利用分布外(OOD)数据动态平衡特征空间。我们通过实验经验性地揭示了OOD样本在SSL长尾学习中反直觉的实用性,并基于原理设计了一种新型SSL方法。具体而言,我们首先根据每个OOD样本在特征空间中的邻域分布为其分配尾部度分数,从而定位“头部”与“尾部”样本;接着提出在线OOD采样策略以动态平衡特征空间;最后通过分布级监督对比损失增强模型区分ID与OOD样本的能力。我们在多种数据集及多个最先进SSL框架上开展了广泛实验,验证了所提方法的有效性。结果表明,本方法显著提升了SSL在长尾数据集上的性能,甚至优于先前使用外部ID数据的工作。我们的代码已开源至https://github.com/JianhongBai/COLT。