Unveiling the Tapestry of Consistency in Large Vision-Language Models

Large vision-language models (LVLMs) have recently achieved rapid progress, exhibiting great perception and reasoning abilities concerning visual information. However, when faced with prompts in different sizes of solution spaces, LVLMs fail to always give consistent answers regarding the same knowledge point. This inconsistency of answers between different solution spaces is prevalent in LVLMs and erodes trust. To this end, we provide a multi-modal benchmark ConBench, to intuitively analyze how LVLMs perform when the solution space of a prompt revolves around a knowledge point. Based on the ConBench tool, we are the first to reveal the tapestry and get the following findings: (1) In the discriminate realm, the larger the solution space of the prompt, the lower the accuracy of the answers. (2) Establish the relationship between the discriminative and generative realms: the accuracy of the discriminative question type exhibits a strong positive correlation with its Consistency with the caption. (3) Compared to open-source models, closed-source models exhibit a pronounced bias advantage in terms of Consistency. Eventually, we ameliorate the consistency of LVLMs by trigger-based diagnostic refinement, indirectly improving the performance of their caption. We hope this paper will accelerate the research community in better evaluating their models and encourage future advancements in the consistency domain.

翻译：大型视觉语言模型（LVLMs）近期取得了快速进展，在视觉信息的感知与推理方面展现出卓越能力。然而，当面对不同解空间规模的提示时，LVLMs 无法始终就同一知识点给出一致的回答。这种不同解空间之间的答案不一致性在 LVLMs 中普遍存在，并侵蚀了模型的可信度。为此，我们提出了一个多模态基准 ConBench，用于直观分析当提示的解空间围绕某一知识点变化时 LVLMs 的表现。基于 ConBench 工具，我们首次揭示了其内在图景并得到以下发现：（1）在判别任务领域，提示的解空间越大，答案的准确率越低。（2）建立了判别领域与生成领域之间的关系：判别式问题类型的准确率与其描述标题的一致性呈现强正相关。（3）与开源模型相比，闭源模型在一致性方面表现出明显的偏置优势。最终，我们通过基于触发器的诊断优化改善了 LVLMs 的一致性，间接提升了其描述生成性能。我们希望本文能推动研究社区更好地评估其模型，并促进一致性领域的未来发展。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/