Multimodal large language models (MLLMs) that integrate visual and textual reasoning leverage chain-of-thought (CoT) prompting to tackle complex visual tasks, yet continue to exhibit visual hallucinations and an over-reliance on textual priors. We present a systematic diagnosis of state-of-the-art vision-language models using a three-stage evaluation framework, uncovering key failure modes. To address these, we propose an agent-based architecture that combines LLM reasoning with lightweight visual modules, enabling fine-grained analysis and iterative refinement of reasoning chains. Our results highlight future visual reasoning models should focus on integrating a broader set of specialized tools for analyzing visual content. Our system achieves significant gains (+10.3 on MMMU, +6.0 on MathVista over a 7B baseline), matching or surpassing much larger models. We will release our framework and evaluation suite to facilitate future research.
翻译:集成视觉与文本推理的多模态大语言模型(MLLMs)利用思维链(CoT)提示来处理复杂视觉任务,但仍持续表现出视觉幻觉及对文本先验的过度依赖。我们通过三阶段评估框架对前沿视觉语言模型进行系统性诊断,揭示了关键失效模式。为应对这些问题,我们提出一种基于智能体的架构,将大语言模型推理与轻量级视觉模块相结合,实现对推理链的细粒度分析与迭代优化。我们的研究结果强调,未来的视觉推理模型应聚焦于整合更广泛的专用工具以解析视觉内容。所提出的系统在多项基准上取得显著提升(在MMMU上提升+10.3分,在MathVista上较7B基线提升+6.0分),达到或超越规模更大的模型性能。我们将开源框架与评估套件以促进后续研究。