RPTS: Tree-Structured Reasoning Process Scoring for Faithful Multimodal Evaluation

Large Vision-Language Models (LVLMs) excel in multimodal reasoning and have shown impressive performance on various multimodal benchmarks. However, most of these benchmarks evaluate models primarily through multiple-choice or short-answer formats, which do not take the reasoning process into account. Although some benchmarks assess the reasoning process, their methods are often overly simplistic and only examine reasoning when answers are incorrect. This approach overlooks scenarios where flawed reasoning leads to correct answers. In addition, these benchmarks do not consider the impact of intermodal relationships on reasoning. To address this issue, we propose the Reasoning Process Tree Score (RPTS), a tree structure-based metric to assess reasoning processes. Specifically, we organize the reasoning steps into a reasoning tree and leverage its hierarchical information to assign weighted faithfulness scores to each reasoning step. By dynamically adjusting these weights, RPTS not only evaluates the overall correctness of the reasoning, but also pinpoints where the model fails in the reasoning. To validate RPTS in real-world multimodal scenarios, we construct a new benchmark, RPTS-Eval, comprising 374 images and 390 reasoning instances. Each instance includes reliable visual-textual clues that serve as leaf nodes of the reasoning tree. Furthermore, we define three types of intermodal relationships to investigate how intermodal interactions influence the reasoning process. We evaluated representative LVLMs (e.g., GPT4o, Llava-Next), uncovering their limitations in multimodal reasoning and highlighting the differences between open-source and closed-source commercial LVLMs. We believe that this benchmark will contribute to the advancement of research in the field of multimodal reasoning.

翻译：大型视觉语言模型（LVLMs）在多模态推理方面表现卓越，并在各类多模态基准测试中展现出令人印象深刻的性能。然而，这些基准测试大多通过选择题或简答题形式评估模型，未能将推理过程纳入考量。尽管部分基准测试会对推理过程进行评估，但其方法往往过于简化，且仅在答案错误时才检验推理过程。这种做法忽略了存在缺陷的推理却得出正确答案的情况。此外，这些基准测试未考虑模态间关系对推理的影响。为解决这一问题，我们提出了推理过程树评分（RPTS），一种基于树状结构的推理过程评估指标。具体而言，我们将推理步骤组织为推理树，并利用其层次结构信息为每个推理步骤分配加权的忠实性分数。通过动态调整这些权重，RPTS不仅能评估推理的整体正确性，还能精确定位模型在推理过程中的失败环节。为在实际多模态场景中验证RPTS，我们构建了一个新基准测试RPTS-Eval，包含374张图像和390个推理实例。每个实例均包含可靠的视觉-文本线索，作为推理树的叶节点。此外，我们定义了三种模态间关系类型，以研究模态间交互如何影响推理过程。我们对代表性LVLMs（如GPT4o、Llava-Next）进行了评估，揭示了它们在多模态推理方面的局限性，并凸显了开源与闭源商业LVLMs之间的差异。我们相信该基准测试将推动多模态推理领域研究的进展。