Despite CLIP being the foundation model in numerous vision-language applications, the CLIP suffers from a severe text spotting bias. Such bias causes CLIP models to `Parrot' the visual text embedded within images while disregarding the authentic visual semantics. We uncover that in the most popular image-text dataset LAION-2B, the captions also densely parrot (spell) the text embedded in images. Our analysis shows that around \textbf{50\%} of images are embedded with visual text content, and \textbf{90\%} of their captions more or less parrot the visual text. Based on such observation, we thoroughly inspect the different release d versions of CLIP models and verify that the visual text is the dominant factor in measuring the LAION-style image-text similarity for these models. To examine whether these parrot captions shape the text spotting bias, we train a series of CLIP models with LAION subsets curated by different parrot-caption-oriented criteria. We show that training with parrot captions easily shapes such bias but harms the expected visual-language representation learning in CLIP models. This suggests that it is urgent to revisit either the design of CLIP-like models or the existing image-text dataset curation pipeline built on CLIP score filtering.
翻译:尽管CLIP是众多视觉语言应用中的基础模型,但其存在严重的文本识别偏差。这种偏差导致CLIP模型会“鹦鹉学舌”般地复述图像中嵌入的视觉文本,而忽略真实的视觉语义。我们发现,在最流行的图文数据集LAION-2B中,标题同样密集地鹦鹉式复述(拼写)了图像中嵌入的文本。我们的分析表明,约\textbf{50\%}的图像包含视觉文本内容,而其中\textbf{90\%}的标题或多或少地鹦鹉式复述了这些视觉文本。基于这一发现,我们深入检验了CLIP模型的不同发布版本,并证实对于这些模型而言,视觉文本是衡量LAION风格图文相似度的主导因素。为探究这些鹦鹉式标题是否塑造了文本识别偏差,我们使用根据不同鹦鹉式标题准则筛选的LAION子集训练了一系列CLIP模型。研究表明,使用鹦鹉式标题训练容易形成此类偏差,但会损害CLIP模型中预期的视觉语言表征学习。这提示我们亟需重新审视CLIP类模型的设计方案,或是基于CLIP分数筛选的现有图文数据集构建流程。