Large vision-language models (LVLMs) have recently achieved rapid progress, exhibiting great perception and reasoning abilities concerning visual information. However, when faced with prompts in different sizes of solution spaces, LVLMs fail to always give consistent answers regarding the same knowledge point. This inconsistency of answers between different solution spaces is prevalent in LVLMs and erodes trust. To this end, we provide a multi-modal benchmark ConBench, to intuitively analyze how LVLMs perform when the solution space of a prompt revolves around a knowledge point. Based on the ConBench tool, we are the first to reveal the tapestry and get the following findings: (1) In the discriminate realm, the larger the solution space of the prompt, the lower the accuracy of the answers. (2) Establish the relationship between the discriminative and generative realms: the accuracy of the discriminative question type exhibits a strong positive correlation with its Consistency with the caption. (3) Compared to open-source models, closed-source models exhibit a pronounced bias advantage in terms of Consistency. Eventually, we ameliorate the consistency of LVLMs by trigger-based diagnostic refinement, indirectly improving the performance of their caption. We hope this paper will accelerate the research community in better evaluating their models and encourage future advancements in the consistency domain.
翻译:大型视觉语言模型(LVLMs)近期取得了快速进展,在视觉信息的感知与推理方面展现出卓越能力。然而,当面对不同解空间规模的提示时,LVLMs 无法始终就同一知识点给出一致的回答。这种不同解空间之间的答案不一致性在 LVLMs 中普遍存在,并侵蚀了模型的可信度。为此,我们提出了一个多模态基准 ConBench,用于直观分析当提示的解空间围绕某一知识点变化时 LVLMs 的表现。基于 ConBench 工具,我们首次揭示了其内在图景并得到以下发现:(1)在判别任务领域,提示的解空间越大,答案的准确率越低。(2)建立了判别领域与生成领域之间的关系:判别式问题类型的准确率与其描述标题的一致性呈现强正相关。(3)与开源模型相比,闭源模型在一致性方面表现出明显的偏置优势。最终,我们通过基于触发器的诊断优化改善了 LVLMs 的一致性,间接提升了其描述生成性能。我们希望本文能推动研究社区更好地评估其模型,并促进一致性领域的未来发展。