On the Effectiveness of Out-of-Distribution Data in Self-Supervised Long-Tail Learning

Though Self-supervised learning (SSL) has been widely studied as a promising technique for representation learning, it doesn't generalize well on long-tailed datasets due to the majority classes dominating the feature space. Recent work shows that the long-tailed learning performance could be boosted by sampling extra in-domain (ID) data for self-supervised training, however, large-scale ID data which can rebalance the minority classes are expensive to collect. In this paper, we propose an alternative but easy-to-use and effective solution, Contrastive with Out-of-distribution (OOD) data for Long-Tail learning (COLT), which can effectively exploit OOD data to dynamically re-balance the feature space. We empirically identify the counter-intuitive usefulness of OOD samples in SSL long-tailed learning and principally design a novel SSL method. Concretely, we first localize the `head' and `tail' samples by assigning a tailness score to each OOD sample based on its neighborhoods in the feature space. Then, we propose an online OOD sampling strategy to dynamically re-balance the feature space. Finally, we enforce the model to be capable of distinguishing ID and OOD samples by a distribution-level supervised contrastive loss. Extensive experiments are conducted on various datasets and several state-of-the-art SSL frameworks to verify the effectiveness of the proposed method. The results show that our method significantly improves the performance of SSL on long-tailed datasets by a large margin, and even outperforms previous work which uses external ID data. Our code is available at https://github.com/JianhongBai/COLT.

翻译：尽管自监督学习（SSL）作为表示学习的一种有前景的技术已被广泛研究，但由于多数类在特征空间中占主导地位，该方法在长尾数据集上难以实现良好的泛化。近期研究表明，通过采样额外的域内（ID）数据进行自监督训练可以提升长尾学习性能，然而能够平衡少数类的大规模域内数据收集成本高昂。本文提出一种替代性、易用且有效的解决方案——基于分布外（OOD）数据对比学习的长尾方法（COLT），该方法能够有效利用OOD数据动态平衡特征空间。我们通过实验验证了OOD样本在SSL长尾学习中反直觉的效用，并原则性地设计了一种新型SSL方法。具体而言，我们首先根据每个OOD样本在特征空间中的邻域分配尾部性分数，定位"头部"和"尾部"样本；随后提出在线OOD采样策略以动态平衡特征空间；最后通过分布级有监督对比损失增强模型对ID与OOD样本的区分能力。在多种数据集与多个最先进SSL框架上的大量实验验证了所提方法的有效性。结果表明，我们的方法显著提升了SSL在长尾数据集上的性能，甚至优于使用外部ID数据的以往工作。我们的代码已开源至https://github.com/JianhongBai/COLT。