CLIP has the ability to align texts and images and is nearly the most frequently used foundation model in cross-modal zero-shot learning. However, our experimental findings reveal that CLIP suffers from a bias in text-to-image retrieval, resulting in a decrease in CLIP's zero-shot learning performance. We analytically discover that the bias partly arises from the imbalanced range of similarity scores obtained by CLIP. Accordingly, we propose a Balanced Similarity with Auxiliary Prompts (BSAP) to mitigate the text-to-image retrieval bias of CLIP. Specifically, our BSAP designs auxiliary prompts for CLIP to calculate multiple similarity scores for the retrieval images and then normalizes the scores between each image and the given query text as well as our auxiliary prompts to obtain balanced similarity scores. The balanced similarity score of the given query text is used for the final retrieval. In addition, we attempt to adopt a hybrid similarity that combines our BSAP with the original similarity of CLIP to obtain a more robust outcome. Extensive experiments on two typical zero-shot learning tasks,i.e., Referring Expression Comprehension (REC) and Referring Image Segmentation (RIS), are conducted to demonstrate the effectiveness of our BSAP. Specifically, when using the val dataset of RefCOCO in REC, BSAP increases CLIP's performance by 20.6%.
翻译:CLIP具备对齐文本与图像的能力,是跨模态零样本学习中使用最频繁的基础模型之一。然而,我们的实验发现,CLIP在文本到图像检索中存在偏差,导致其零样本学习性能下降。通过分析,我们揭示该偏差部分源于CLIP获取的相似度分数范围不均衡。据此,我们提出辅助提示的平衡相似性(BSAP)以缓解CLIP的文本到图像检索偏差。具体而言,BSAP为CLIP设计辅助提示,对检索图像计算多个相似度分数,随后将每张图像与给定查询文本以及辅助提示之间的分数归一化,从而获得平衡的相似度分数。最终检索使用给定查询文本的平衡相似度分数。此外,我们尝试采用混合相似度策略,将BSAP与CLIP的原始相似度结合以获得更稳健的结果。在两项典型零样本学习任务(指代表达理解REC和指代图像分割RIS)上的广泛实验证明了BSAP的有效性。例如,在REC任务中基于RefCOCO验证集,BSAP使CLIP的性能提升20.6%。