Typographic attacks exploit the interplay between text and visual content in multimodal foundation models, causing misclassifications when misleading text is embedded within images. Existing datasets are limited in size and diversity, making it difficult to study such vulnerabilities. In this paper, we introduce SCAM, the largest and most diverse dataset of real-world typographic attack images to date, containing 1162 images across hundreds of object categories and attack words. Through extensive benchmarking of Vision-Language Models on SCAM, we demonstrate that typographic attacks significantly degrade performance, and identify that training data and model architecture influence the susceptibility to these attacks. Our findings indicate that typographic attacks remain effective against state-of-the-art Large Vision-Language Models, especially those employing vision encoders inherently vulnerable to such attacks. However, employing larger Large Language Model backbones reduces this vulnerability while simultaneously enhancing typographic understanding. Additionally, we demonstrate that synthetic attacks closely resemble real-world (handwritten) attacks, validating their use in research. Our work provides a comprehensive resource and empirical insights to facilitate future research toward robust and trustworthy multimodal AI systems. Finally, we publicly release the datasets introduced in this paper, along with the code for evaluations under www.bliss.berlin/research/scam.
翻译:排版攻击利用多模态基础模型中文本与视觉内容之间的交互作用,当误导性文本嵌入图像时会导致错误分类。现有数据集在规模和多样性上均存在局限,难以系统研究此类漏洞。本文提出了SCAM——迄今为止规模最大、多样性最丰富的真实世界排版攻击图像数据集,包含涵盖数百个物体类别和攻击词汇的1162张图像。通过对Vision-Language模型在SCAM上进行广泛基准测试,我们证明排版攻击会显著降低模型性能,并发现训练数据和模型架构会影响对此类攻击的敏感性。研究结果表明,排版攻击对当前最先进的大型视觉语言模型仍然有效,特别是那些采用固有易受此类攻击的视觉编码器的模型。然而,采用更大型的语言模型骨干网络能降低这种脆弱性,同时增强排版理解能力。此外,我们证明合成攻击与真实世界(手写)攻击具有高度相似性,验证了其在研究中的适用性。本工作为未来构建鲁棒可信的多模态人工智能系统提供了全面的资源与实证依据。最后,我们公开了本文提出的数据集及评估代码,详见www.bliss.berlin/research/scam。