While GPT-4V(ision) impressively models both visual and textual information simultaneously, it's hallucination behavior has not been systematically assessed. To bridge this gap, we introduce a new benchmark, namely, the Bias and Interference Challenges in Visual Language Models (Bingo). This benchmark is designed to evaluate and shed light on the two common types of hallucinations in visual language models: bias and interference. Here, bias refers to the model's tendency to hallucinate certain types of responses, possibly due to imbalance in its training data. Interference pertains to scenarios where the judgment of GPT-4V(ision) can be disrupted due to how the text prompt is phrased or how the input image is presented. We identify a notable regional bias, whereby GPT-4V(ision) is better at interpreting Western images or images with English writing compared to images from other countries or containing text in other languages. Moreover, GPT-4V(ision) is vulnerable to leading questions and is often confused when interpreting multiple images together. Popular mitigation approaches, such as self-correction and chain-of-thought reasoning, are not effective in resolving these challenges. We also identified similar biases and interference vulnerabilities with LLaVA and Bard. Our results characterize the hallucination challenges in GPT-4V(ision) and state-of-the-art visual-language models, and highlight the need for new solutions. The Bingo benchmark is available at https://github.com/gzcch/Bingo.
翻译:尽管GPT-4V(ision)在同时建模视觉与文本信息方面表现卓越,但其幻觉行为尚未得到系统性评估。为弥补这一空白,我们引入了一个全新基准——视觉语言模型中的偏差与干扰挑战(Bingo)。该基准旨在评估并阐明视觉语言模型中两类常见的幻觉:偏差与干扰。其中,偏差指模型因训练数据不平衡而倾向于幻觉特定类型响应的现象;干扰则涉及GPT-4V(ision)的判断因文本提示的措辞方式或输入图像的呈现方式而受到干扰的场景。我们识别出显著的区域性偏差:相较于其他国家图像或包含其他语言文本的图像,GPT-4V(ision)更擅长解读西方图像或含英文书写的图像。此外,GPT-4V(ision)易受引导性问题影响,且在同时解读多幅图像时时常产生混淆。常见的缓解方法(如自我纠错与思维链推理)对这些挑战无效。我们还发现LLaVA和Bard存在类似的偏差与干扰脆弱性。我们的研究结果系统表征了GPT-4V(ision)及当前最先进视觉-语言模型的幻觉挑战,并强调了新解决方案的必要性。Bingo基准可访问https://github.com/gzcch/Bingo获取。