Large Vision-Language Models (LVLMs) have demonstrated outstanding performance across various multimodal tasks. However, they suffer from a problem known as language prior, where responses are generated based solely on textual patterns while disregarding image information. Addressing the issue of language prior is crucial, as it can lead to undesirable biases or hallucinations when dealing with images that are out of training distribution. Despite its importance, current methods for accurately measuring language priors in LVLMs are poorly studied. Although existing benchmarks based on counterfactual or out-of-distribution images can partially be used to measure language priors, they fail to disentangle language priors from other confounding factors. To this end, we propose a new benchmark called VLind-Bench, which is the first benchmark specifically designed to measure the language priors, or blindness, of LVLMs. It not only includes tests on counterfactual images to assess language priors but also involves a series of tests to evaluate more basic capabilities such as commonsense knowledge, visual perception, and commonsense biases. For each instance in our benchmark, we ensure that all these basic tests are passed before evaluating the language priors, thereby minimizing the influence of other factors on the assessment. The evaluation and analysis of recent LVLMs in our benchmark reveal that almost all models exhibit a significant reliance on language priors, presenting a strong challenge in the field.
翻译:大型视觉语言模型(LVLM)在各种多模态任务中展现出了卓越的性能。然而,它们存在一个被称为“语言先验”的问题,即模型仅基于文本模式生成响应,而忽视了图像信息。解决语言先验问题至关重要,因为当处理超出训练分布的图像时,它可能导致不良的偏见或幻觉。尽管这一问题非常重要,但目前准确衡量LVLM中语言先验的方法研究尚不充分。尽管现有的基于反事实或分布外图像的基准可以部分用于衡量语言先验,但它们未能将语言先验与其他混杂因素分离开来。为此,我们提出了一个名为VLind-Bench的新基准,这是首个专门设计用于衡量LVLM语言先验(或“盲视”)的基准。它不仅包含针对反事实图像的测试以评估语言先验,还涉及一系列测试来评估更基础的能力,如常识知识、视觉感知和常识偏见。在我们的基准中,对于每个测试实例,我们确保所有这些基础测试均通过后再评估语言先验,从而最大限度地减少其他因素对评估的影响。通过对近期LVLM在我们基准上的评估与分析,我们发现几乎所有模型都表现出对语言先验的显著依赖,这给该领域带来了严峻挑战。