State-of-the-art image models predominantly follow a two-stage strategy: pre-training on large datasets and fine-tuning with cross-entropy loss. Many studies have shown that using cross-entropy can result in sub-optimal generalisation and stability. While the supervised contrastive loss addresses some limitations of cross-entropy loss by focusing on intra-class similarities and inter-class differences, it neglects the importance of hard negative mining. We propose that models will benefit from performance improvement by weighting negative samples based on their dissimilarity to positive counterparts. In this paper, we introduce a new supervised contrastive learning objective, SCHaNe, which incorporates hard negative sampling during the fine-tuning phase. Without requiring specialized architectures, additional data, or extra computational resources, experimental results indicate that SCHaNe outperforms the strong baseline BEiT-3 in Top-1 accuracy across various benchmarks, with significant gains of up to $3.32\%$ in few-shot learning settings and $3.41\%$ in full dataset fine-tuning. Importantly, our proposed objective sets a new state-of-the-art for base models on ImageNet-1k, achieving an 86.14\% accuracy. Furthermore, we demonstrate that the proposed objective yields better embeddings and explains the improved effectiveness observed in our experiments.
翻译:最先进的图像模型主要遵循两阶段策略:在大型数据集上进行预训练,然后使用交叉熵损失进行微调。许多研究表明,使用交叉熵可能导致次优的泛化性和稳定性。尽管监督对比损失通过关注类内相似性和类间差异性解决了交叉熵损失的一些局限性,但它忽略了硬负挖掘的重要性。我们提出,通过根据负样本与正样本的不相似性对其进行加权,模型将受益于性能提升。在本文中,我们引入了一种新的监督对比学习目标SCHaNe,它在微调阶段结合了硬负采样。无需专门的架构、额外数据或额外的计算资源,实验结果表明,SCHaNe在各种基准测试中的Top-1准确率上优于强基线BEiT-3,在小样本学习场景中显著提升了高达$3.32\%$,在全数据集微调中提升了$3.41\%$。重要的是,我们提出的目标在ImageNet-1k上为基模型设置了新的最先进水平,达到了86.14\%的准确率。此外,我们证明了所提出的目标能产生更好的嵌入,并解释了我们在实验中观察到的有效性提升。