State-of-the-art pre-trained image models predominantly adopt a two-stage approach: initial unsupervised pre-training on large-scale datasets followed by task-specific fine-tuning using Cross-Entropy loss~(CE). However, it has been demonstrated that CE can compromise model generalization and stability. While recent works employing contrastive learning address some of these limitations by enhancing the quality of embeddings and producing better decision boundaries, they often overlook the importance of hard negative mining and rely on resource intensive and slow training using large sample batches. To counter these issues, we introduce a novel approach named CLCE, which integrates Label-Aware Contrastive Learning with CE. Our approach not only maintains the strengths of both loss functions but also leverages hard negative mining in a synergistic way to enhance performance. Experimental results demonstrate that CLCE significantly outperforms CE in Top-1 accuracy across twelve benchmarks, achieving gains of up to 3.52% in few-shot learning scenarios and 3.41% in transfer learning settings with the BEiT-3 model. Importantly, our proposed CLCE approach effectively mitigates the dependency of contrastive learning on large batch sizes such as 4096 samples per batch, a limitation that has previously constrained the application of contrastive learning in budget-limited hardware environments.
翻译:最先进的预训练图像模型主要采用两阶段方法:首先在大规模数据集上进行无监督预训练,随后使用交叉熵损失(CE)进行任务特定微调。然而,研究表明CE会损害模型的泛化能力和稳定性。尽管近期采用对比学习的工作通过提升嵌入质量和生成更优决策边界解决了其中部分局限性,但这些方法常忽视困难负样本挖掘的重要性,且依赖大规模样本批次带来的资源密集型慢速训练。针对这些问题,我们提出了一种名为CLCE(标签感知对比学习与交叉熵融合)的新方法。该方法不仅保留了两种损失函数的优势,还以协同方式利用困难负样本挖掘来提升性能。实验结果表明,在十二个基准测试中,CLCE在Top-1准确率上显著优于CE,在少样本学习场景中最高提升3.52%,在采用BEiT-3模型的迁移学习设置中提升3.41%。尤为重要的是,我们提出的CLCE方法有效缓解了对比学习对大批量样本(如每批次4096个样本)的依赖性——这一局限性此前限制了对比学习在预算有限的硬件环境中的应用。