Large Vision-Language Models (LVLMs) rely on vision encoders and Large Language Models (LLMs) to exhibit remarkable capabilities on various multi-modal tasks in the joint space of vision and language. However, the Typographic Attack, which disrupts vision-language models (VLMs) such as Contrastive Language-Image Pretraining (CLIP), has also been expected to be a security threat to LVLMs. Firstly, we verify typographic attacks on current well-known commercial and open-source LVLMs and uncover the widespread existence of this threat. Secondly, to better assess this vulnerability, we propose the most comprehensive and largest-scale Typographic Dataset to date. The Typographic Dataset not only considers the evaluation of typographic attacks under various multi-modal tasks but also evaluates the effects of typographic attacks, influenced by texts generated with diverse factors. Based on the evaluation results, we investigate the causes why typographic attacks may impact VLMs and LVLMs, leading to three highly insightful discoveries. By the examination of our discoveries and experimental validation in the Typographic Dataset, we reduce the performance degradation from $42.07\%$ to $13.90\%$ when LVLMs confront typographic attacks.
翻译:大型视觉-语言模型(LVLMs)依赖视觉编码器和大型语言模型(LLMs),在视觉与语言的联合空间中展现出处理多种多模态任务的卓越能力。然而,排版攻击(Typographic Attack)作为一种破坏视觉-语言模型(VLMs)(如对比语言-图像预训练模型CLIP)的手段,也被预期为LVLMs的安全威胁。首先,我们验证了当前知名商业和开源LVLMs上的排版攻击,揭示了这一威胁的广泛存在。其次,为更好地评估这一脆弱性,我们提出了迄今为止最全面、规模最大的排版数据集。该数据集不仅考虑多种多模态任务下排版攻击的评估,还评估了由不同因素生成的文本所影响的排版攻击效果。基于评估结果,我们探究了排版攻击可能影响VLMs和LVLMs的原因,得出了三项极具洞察力的发现。通过对这些发现的审视以及在排版数据集上的实验验证,我们将LVLMs面对排版攻击时的性能下降从42.07%降低至13.90%。