With the success of Large Language Models (LLMs), many Generative Vision-Language Models (GVLMs) have been constructed via multimodal instruction tuning. However, the performance of GVLMs in multimodal compositional reasoning remains under-explored. In this paper, we examine both the evaluation metrics (VisualGPTScore, etc.) and current benchmarks for evaluating the compositionality of GVLMs. We identify the syntactical bias in current benchmarks, which is exploited by the linguistic capability of GVLMs. The bias renders VisualGPTScore an insufficient metric for assessing GVLMs. To combat this, we first introduce a SyntaxBias Score, leveraging LLMs to quantify such bias for mitigation. A challenging new task is subsequently added to evaluate the robustness of GVLMs against inherent inclination toward syntactical correctness. Using the bias-mitigated datasets and the new task, we propose a novel benchmark, namely SyntActically DE-biased benchmark (SADE). Our study provides an unbiased benchmark for the compositionality of GVLMs, facilitating future research in this direction (Code and dataset are available at https://github.com/TeleeMa/SADE).
翻译:随着大型语言模型(LLM)的成功,众多生成式视觉-语言模型(GVLM)通过多模态指令微调得以构建。然而,GVLM在多模态组合推理中的表现仍缺乏充分探索。本文系统考察了评估GVLM组合性的评价指标(如VisualGPTScore等)及现有基准数据集,发现当前基准存在语法偏差问题——GVLM的语言能力会利用这种偏差,导致VisualGPTScore不足以有效评估模型性能。为应对该问题,我们首先提出语法偏差分数(SyntaxBias Score),利用LLM量化此类偏差以进行消解。随后引入一项具有挑战性的新任务,用于评估GVLM抵抗语法正确性固有倾向的鲁棒性。基于偏差消解后的数据集与新任务,我们构建了名为语法解偏基准(SADE)的新型基准。本研究为GVLM组合性评估提供了无偏基准,将推动该方向的后续研究(代码与数据集见https://github.com/TeleeMa/SADE)。