Towards Alleviating Text-to-Image Retrieval Hallucination for CLIP in Zero-shot Learning

from arxiv, This work has been submitted to the lEEE for possible publication. Copyright may betransferred without notice, after which this version may no longer be accessible

Pretrained cross-modal models, for instance, the most representative CLIP, have recently led to a boom in using pre-trained models for cross-modal zero-shot tasks, considering the generalization properties. However, we analytically discover that CLIP suffers from the text-to-image retrieval hallucination, adversely limiting its capabilities under zero-shot learning: CLIP would select the image with the highest score when asked to figure out which image perfectly matches one given query text among several candidate images even though CLIP knows contents in the image. Accordingly, we propose a Balanced Score with Auxiliary Prompts (BSAP) to mitigate the CLIP's text-to-image retrieval hallucination under zero-shot learning. Specifically, we first design auxiliary prompts to provide multiple reference outcomes for every single image retrieval, then the outcomes derived from each retrieved image in conjunction with the target text are normalized to obtain the final similarity, which alleviates hallucinations in the model. Additionally, we can merge CLIP's original results and BSAP to obtain a more robust hybrid outcome (BSAP-H). Extensive experiments on two typical zero-shot learning tasks, i.e., Referring Expression Comprehension (REC) and Referring Image Segmentation (RIS), are conducted to demonstrate the effectiveness of our BSAP. Specifically, when evaluated on the validation dataset of RefCOCO in REC, BSAP increases CLIP's performance by 20.6%. Further, we validate that our strategy could be applied in other types of pretrained cross-modal models, such as ALBEF and BLIP.

翻译：预训练跨模态模型（例如最具代表性的CLIP）凭借其泛化特性，近年来推动了预训练模型在跨模态零样本任务中的应用热潮。然而，我们通过分析发现，CLIP存在文本-图像检索幻觉问题，这严重限制了其在零样本学习下的能力：当要求CLIP从若干候选图像中找出与给定查询文本完全匹配的图像时，即使CLIP能够识别图像内容，仍会选择得分最高的图像。为此，我们提出了一种带辅助提示的平衡评分方法（BSAP），以缓解零样本学习下CLIP的文本-图像检索幻觉。具体而言，我们首先设计辅助提示，为每次单图像检索提供多个参考结果；随后将每个检索图像与目标文本结合得到的结果进行归一化处理，从而获得最终相似度评分，以此减轻模型中的幻觉现象。此外，我们可以将CLIP的原始结果与BSAP相结合，得到更稳健的混合结果（BSAP-H）。我们在两个典型的零样本学习任务——指代表达理解（REC）与指代图像分割（RIS）上进行了大量实验，以验证BSAP的有效性。具体而言，在REC任务的RefCOCO验证集上评估时，BSAP将CLIP的性能提升了20.6%。进一步地，我们验证了该策略可适用于其他类型的预训练跨模态模型，如ALBEF和BLIP。