AdaReasoner: Dynamic Tool Orchestration for Iterative Visual Reasoning

When humans face problems beyond their immediate capabilities, they rely on tools, providing a promising paradigm for improving visual reasoning in multimodal large language models (MLLMs). Effective reasoning, therefore, hinges on knowing which tools to use, when to invoke them, and how to compose them over multiple steps, even when faced with new tools or new tasks. We introduce \textbf{AdaReasoner}, a family of multimodal models that learn tool use as a general reasoning skill rather than as tool-specific or explicitly supervised behavior. AdaReasoner is enabled by (i) a scalable data curation pipeline exposing models to long-horizon, multi-step tool interactions; (ii) Tool-GRPO, a reinforcement learning algorithm that optimizes tool selection and sequencing based on end-task success; and (iii) an adaptive learning mechanism that dynamically regulates tool usage. Together, these components allow models to infer tool utility from task context and intermediate outcomes, enabling coordination of multiple tools and generalization to unseen tools. Empirically, AdaReasoner exhibits strong tool-adaptive and generalization behaviors: it autonomously adopts beneficial tools, suppresses irrelevant ones, and adjusts tool usage frequency based on task demands, despite never being explicitly trained to do so. These capabilities translate into state-of-the-art performance across challenging benchmarks, improving the 7B base model by +24.9\% on average and surpassing strong proprietary systems such as GPT-5 on multiple tasks, including VSP and Jigsaw.

翻译：当人类面临超出自身即时能力的问题时，会借助工具来解决，这为提升多模态大语言模型（MLLMs）的视觉推理能力提供了一个有前景的范式。因此，有效的推理关键在于知道使用哪些工具、何时调用它们以及如何在多步骤中组合它们，即使面对新工具或新任务时也是如此。我们提出了 \textbf{AdaReasoner}，这是一个多模态模型系列，它将工具使用学习为一种通用的推理技能，而非特定于工具或显式监督的行为。AdaReasoner 的实现得益于三个关键组成部分：（一）一个可扩展的数据整理流程，使模型能够接触长视野、多步骤的工具交互；（二）Tool-GRPO，一种基于最终任务成功率优化工具选择与序列的强化学习算法；（三）一种动态调节工具使用的自适应学习机制。这些组件共同使得模型能够从任务上下文和中间结果推断工具的效用，从而实现多个工具的协调以及对未见工具的泛化。实证研究表明，AdaReasoner 展现出强大的工具自适应与泛化行为：它能自主采用有益的工具、抑制无关工具，并根据任务需求调整工具使用频率，尽管从未被显式训练去执行这些行为。这些能力转化为在多个具有挑战性的基准测试中达到最先进的性能，将 7B 基础模型的平均性能提升了 +24.9\%，并在包括 VSP 和 Jigsaw 在内的多项任务上超越了 GPT-5 等强大的专有系统。