Large Vision-Language Models (LVLMs) have shown significant progress in well responding to visual-instructions from users. However, these instructions, encompassing images and text, are susceptible to both intentional and inadvertent attacks. Despite the critical importance of LVLMs' robustness against such threats, current research in this area remains limited. To bridge this gap, we introduce AVIBench, a framework designed to analyze the robustness of LVLMs when facing various adversarial visual-instructions (AVIs), including four types of image-based AVIs, ten types of text-based AVIs, and nine types of content bias AVIs (such as gender, violence, cultural, and racial biases, among others). We generate 260K AVIs encompassing five categories of multimodal capabilities (nine tasks) and content bias. We then conduct a comprehensive evaluation involving 14 open-source LVLMs to assess their performance. AVIBench also serves as a convenient tool for practitioners to evaluate the robustness of LVLMs against AVIs. Our findings and extensive experimental results shed light on the vulnerabilities of LVLMs, and highlight that inherent biases exist even in advanced closed-source LVLMs like GeminiProVision and GPT-4V. This underscores the importance of enhancing the robustness, security, and fairness of LVLMs. The source code and benchmark will be made publicly available.
翻译:大型视觉语言模型在响应用户视觉指令方面取得了显著进展。然而,这些包含图像和文本的指令容易遭受有意或无意的攻击。尽管针对此类威胁的鲁棒性研究至关重要,但当前该领域的研究仍十分有限。为填补这一空白,我们提出了AVIBench框架,旨在分析大型视觉语言模型面对多种对抗性视觉指令时的鲁棒性,包括四种基于图像的对抗性视觉指令、十种基于文本的对抗性视觉指令以及九种内容偏见过对抗性视觉指令(如性别、暴力、文化和种族偏见等)。我们生成了涵盖五类多模态能力(九项任务)和内容偏见的260K对抗性视觉指令,并对14个开源大型视觉语言模型进行了全面评估。AVIBench还可作为便捷工具供实践者评估大型视觉语言模型对抗性视觉指令的鲁棒性。研究结果和大量实验揭示了大型视觉语言模型的脆弱性,并指出即使在GeminiProVision和GPT-4V等先进闭源模型中仍存在固有偏见。这凸显了提升大型视觉语言模型鲁棒性、安全性和公平性的重要性。相关源代码和基准数据集将公开发布。