Adaptive multimodal reasoning has emerged as a promising frontier in Vision-Language Models (VLMs), aiming to dynamically modulate between tool-augmented visual reasoning and text reasoning to enhance both effectiveness and efficiency. However, existing evaluations rely on static difficulty labels and simplistic metrics, which fail to capture the dynamic nature of difficulty relative to varying model capacities. Consequently, they obscure the distinction between adaptive mode selection and general performance while neglecting fine-grained process analyses. In this paper, we propose AdaptMMBench, a comprehensive benchmark for adaptive multimodal reasoning across five domains: real-world, OCR, GUI, knowledge, and math, encompassing both direct perception and complex reasoning tasks. AdaptMMBench utilizes a Matthews Correlation Coefficient (MCC) metric to evaluate the selection rationality of different reasoning modes, isolating this meta-cognition ability by dynamically identifying task difficulties based on models' capability boundaries. Moreover, AdaptMMBench facilitates multi-dimensional process evaluation across key step coverage, tool effectiveness, and computational efficiency. Our evaluation reveals that while adaptive mode selection scales with model capacity, it notably decouples from final accuracy. Conversely, key step coverage aligns with performance, though tool effectiveness remains highly inconsistent across model architectures.
翻译:自适应多模态推理已成为视觉-语言模型(VLMs)中一个前景广阔的前沿方向,其目标是在工具增强的视觉推理与文本推理之间进行动态调制,以同时提升效果与效率。然而,现有评估方法依赖于静态的难度标签和简化的度量指标,无法捕捉难度相对于不同模型能力的动态特性。因此,这些方法模糊了自适应模态选择与通用性能之间的区别,同时忽略了细粒度的过程分析。本文提出AdaptMMBench,一个涵盖现实世界、OCR、GUI、知识与数学五个领域的自适应多模态推理综合基准,包含直接感知与复杂推理任务。AdaptMMBench采用马修斯相关系数(MCC)指标来评估不同推理模态的选择合理性,通过基于模型能力边界动态识别任务难度,从而分离出这种元认知能力。此外,AdaptMMBench支持跨关键步骤覆盖度、工具有效性和计算效率的多维过程评估。我们的评估表明,虽然自适应模态选择能力随模型规模提升,但其与最终准确率显著解耦。相反,关键步骤覆盖度与性能表现一致,但工具有效性在不同模型架构间仍存在高度不一致性。