Multimodal models leverage large-scale pre-training to achieve strong but still imperfect performance on tasks such as image captioning, visual question answering, and cross-modal retrieval. In this paper, we present a simple and efficient method for correcting errors in trained contrastive image-text retrieval models with no additional training, called Nearest Neighbor Normalization (NNN). We show an improvement on retrieval metrics in both text retrieval and image retrieval for all of the contrastive models that we tested (CLIP, BLIP, ALBEF, SigLIP, BEiT) and for both of the datasets that we used (MS-COCO and Flickr30k). NNN requires a reference database, but does not require any training on this database, and can even increase the retrieval accuracy of a model after finetuning.
翻译:多模态模型通过大规模预训练在图像描述生成、视觉问答及跨模态检索等任务上取得了显著但仍不完美的性能。本文提出一种无需额外训练即可修正已训练对比式图文检索模型误差的简单高效方法,称为最近邻归一化(NNN)。实验表明,该方法在我们测试的所有对比模型(CLIP、BLIP、ALBEF、SigLIP、BEiT)和两个数据集(MS-COCO与Flickr30k)上,均能提升文本检索与图像检索的评估指标。NNN方法需要参考数据库支持,但无需对该数据库进行任何训练,甚至能在模型微调后进一步提升其检索精度。