The objective in this paper is to improve the performance of text-to-image retrieval. To this end, we introduce a new framework that can boost the performance of large-scale pre-trained vision-language models, so that they can be used for text-to-image re-ranking. The approach, Enhanced Language-Image Pre-training (ELIP), uses the text query, via a simple MLP mapping network, to predict a set of visual prompts to condition the ViT image encoding. ELIP can easily be applied to the commonly used CLIP, SigLIP and BLIP-2 networks. To train the architecture with limited computing resources, we develop a 'student friendly' best practice, involving global hard sample mining, and curation of a large-scale dataset. On the evaluation side, we set up two new out-of-distribution (OOD) benchmarks, Occluded COCO and ImageNet-R, to assess the zero-shot generalisation of the models to different domains. The results demonstrate that ELIP significantly boosts CLIP/SigLIP/SigLIP-2 text-to-image retrieval performance and outperforms BLIP-2 on several benchmarks, as well as providing an easy means to adapt to OOD datasets.
翻译:本文旨在提升文本到图像检索的性能。为此,我们提出一种新框架,能够增强大规模预训练视觉-语言模型的性能,使其适用于文本到图像的重新排序。该方法——增强型语言-图像预训练(ELIP)——通过一个简单的MLP映射网络,利用文本查询预测一组视觉提示,以调节ViT图像编码。ELIP可轻松应用于常用的CLIP、SigLIP和BLIP-2网络。为在有限计算资源下训练该架构,我们开发了一套“学生友好型”最佳实践,包括全局困难样本挖掘和大规模数据集的整理。在评估方面,我们建立了两个新的分布外(OOD)基准测试——遮挡COCO和ImageNet-R,以评估模型在不同领域的零样本泛化能力。结果表明,ELIP显著提升了CLIP/SigLIP/SigLIP-2的文本到图像检索性能,在多个基准测试中优于BLIP-2,同时为适应OOD数据集提供了简便的途径。