Boosting Visual-Language Models by Exploiting Hard Samples

Large vision and language models, such as Contrastive Language-Image Pre-training (CLIP), are rapidly becoming the industry norm for matching images and texts. In order to improve its zero-shot recognition performance, current research either adds additional web-crawled image-text pairs or designs new training losses. However, the additional costs associated with training from scratch and data collection substantially hinder their deployment. In this paper, we present HELIP, a low-cost strategy for boosting the performance of well-trained CLIP models by finetuning them with hard samples over original training data. Mixing hard examples into each batch, the well-trained CLIP model is then fine-tuned using the conventional contrastive alignment objective and a margin loss to distinguish between normal and hard negative data. HELIP is deployed in a plug-and-play fashion to existing models. On a comprehensive zero-shot and retrieval benchmark, without training the model from scratch or utilizing additional data, HELIP consistently boosts existing models to achieve leading performance. In particular, HELIP boosts ImageNet zero-shot accuracy of SLIP by 3.05 and 4.47 when pretrained on CC3M and CC12M respectively. In addition, a systematic evaluation of zero-shot and linear probing experiments across fine-grained classification datasets demonstrates a consistent performance improvement and validates the efficacy of HELIP . When pretraining on CC3M, HELIP boosts zero-shot performance of CLIP and SLIP by 8.4\% and 18.6\% on average respectively, and linear probe performance by 9.5\% and 3.0\% on average respectively.

翻译：大型视觉与语言模型，如对比语言-图像预训练（CLIP），正迅速成为图像与文本匹配的行业标准。为提升其零样本识别性能，当前研究要么添加额外网络爬取的图像-文本对，要么设计新的训练损失函数。然而，从头训练所需的高昂附加成本及数据收集工作严重阻碍了其实际部署。本文提出HELIP——一种低成本策略，通过利用原始训练数据中的困难样本微调已训练完善的CLIP模型来增强其性能。通过将困难样本混入每个批次，采用传统对比对齐目标函数与边际损失对已训练好的CLIP模型进行微调，以区分正常样本与困难负样本。HELIP以即插即用方式应用于现有模型。在全面的零样本与检索基准测试中，无需从头训练模型或利用额外数据，HELIP始终能提升现有模型性能并达到领先水平。具体而言，在CC3M和CC12M数据集上预训练时，HELIP将SLIP的ImageNet零样本准确率分别提升3.05%和4.47%。此外，针对细粒度分类数据集的零样本与线性探测实验系统性评估表明，该方法具有持续性能提升效果，验证了HELIP的有效性。在CC3M预训练设置下，HELIP分别使CLIP和SLIP的零样本平均性能提升8.4%和18.6%，线性探测平均性能提升9.5%和3.0%。