Compositional reasoning capabilities are usually considered as fundamental skills to characterize human perception. Recent studies show that current Vision Language Models (VLMs) surprisingly lack sufficient knowledge with respect to such capabilities. To this end, we propose to thoroughly diagnose the composition representations encoded by VLMs, systematically revealing the potential cause for this weakness. Specifically, we propose evaluation methods from a novel game-theoretic view to assess the vulnerability of VLMs on different aspects of compositional understanding, e.g., relations and attributes. Extensive experimental results demonstrate and validate several insights to understand the incapabilities of VLMs on compositional reasoning, which provide useful and reliable guidance for future studies. The deliverables will be updated at https://vlms-compositionality-gametheory.github.io/.
翻译:组合推理能力通常被视为表征人类感知的基本技能。近期研究表明,当前视觉语言模型(VLMs)在此类能力方面明显缺乏足够知识。为此,我们提出从博弈论新视角系统诊断VLMs编码的组合表征,以揭示该缺陷的潜在成因。具体而言,我们设计了基于博弈论的评估方法,用于检测VLMs在关系、属性等不同组合理解维度上的脆弱性。大量实验结果验证了多项关于VLMs组合推理缺陷的洞见,为未来研究提供了可靠的理论指引。项目成果将持续更新于 https://vlms-compositionality-gametheory.github.io/。