Despite CLIP being the foundation model in numerous vision-language applications, the CLIP suffers from a severe text spotting bias. Such bias causes CLIP models to `Parrot' the visual text embedded within images while disregarding the authentic visual semantics. We uncover that in the most popular image-text dataset LAION-2B, the captions also densely parrot (spell) the text embedded in images. Our analysis shows that around 50% of images are embedded with visual text content, and around 30% of captions words are in these embedded visual content. Based on such observation, we thoroughly inspect the different released versions of CLIP models and verify that the visual text is the dominant factor in measuring the LAION-style image-text similarity for these models. To examine whether these parrot captions shape the text spotting bias, we train a series of CLIP models with LAION subsets curated by different parrot-caption-oriented criteria. We show that training with parrot captions easily shapes such bias but harms the expected visual-language representation learning in CLIP models. This suggests that it is urgent to revisit either the design of CLIP-like models or the existing image-text dataset curation pipeline built on CLIP score filtering.
翻译:摘要:尽管CLIP已成为众多视觉语言任务的基础模型,但其存在严重的文本识别偏差。这种偏差导致CLIP模型会“鹦鹉学舌”般复制图像中的视觉文本,而忽略真实的视觉语义。我们发现,在主流图文数据集LAION-2B中,标注文本同样密集地包含(拼写)了图像中的文本。分析表明,约50%的图像含有视觉文本内容,且约30%的标注词汇来源于这些嵌入的视觉内容。基于此发现,我们系统检测了不同版本的CLIP模型,证实视觉文本是这些模型衡量LAION型图文相似度的主导因素。为验证这种鹦鹉学舌式标注是否塑造了文本识别偏差,我们采用不同鹦鹉标注导向标准筛选的LAION子集训练了一系列CLIP模型。结果表明,使用鹦鹉学舌式标注易塑造此类偏差,但会损害CLIP模型预期的视觉语言表征学习效果。这提示我们亟需重新审视CLIP类模型的设计范式,或基于CLIP分数过滤的现有图文数据集构建流程。