AdaReasoner: Dynamic Tool Orchestration for Iterative Visual Reasoning

When humans face problems beyond their immediate capabilities, they rely on tools, providing a promising paradigm for improving visual reasoning in multimodal large language models (MLLMs). Effective reasoning, therefore, hinges on knowing which tools to use, when to invoke them, and how to compose them over multiple steps, even when faced with new tools or new tasks. We introduce \textbf{AdaReasoner}, a family of multimodal models that learn tool use as a general reasoning skill rather than as tool-specific or explicitly supervised behavior. AdaReasoner is enabled by (i) a scalable data curation pipeline exposing models to long-horizon, multi-step tool interactions; (ii) Tool-GRPO, a reinforcement learning algorithm that optimizes tool selection and sequencing based on end-task success; and (iii) an adaptive learning mechanism that dynamically regulates tool usage. Together, these components allow models to infer tool utility from task context and intermediate outcomes, enabling coordination of multiple tools and generalization to unseen tools. Empirically, AdaReasoner exhibits strong tool-adaptive and generalization behaviors: it autonomously adopts beneficial tools, suppresses irrelevant ones, and adjusts tool usage frequency based on task demands, despite never being explicitly trained to do so. These capabilities translate into state-of-the-art performance across challenging benchmarks, improving the 7B base model by +24.9\% on average and surpassing strong proprietary systems such as GPT-5 on multiple tasks, including VSP and Jigsaw.

翻译：当人类面临超出自身即时能力的问题时，会借助工具来解决，这为提升多模态大语言模型（MLLMs）的视觉推理能力提供了有前景的范式。因此，有效的推理关键在于知道使用哪些工具、何时调用它们以及如何在多步骤中组合它们，即使面对新工具或新任务时亦需如此。我们提出 \textbf{AdaReasoner}，一个多模态模型系列，其将工具使用学习为一种通用推理技能，而非针对特定工具或受显式监督的行为。AdaReasoner 的实现基于三个核心组件：（一）可扩展的数据构建流程，使模型能够接触长视野、多步骤的工具交互；（二）Tool-GRPO 强化学习算法，根据最终任务成功率优化工具选择与序列编排；（三）自适应学习机制，动态调节工具使用策略。这些组件共同使模型能够从任务上下文与中间结果推断工具效用，实现多工具协同并泛化至未见过的工具。实验表明，AdaReasoner 展现出强大的工具适应与泛化能力：尽管从未接受过相关显式训练，它能够自主采用有益工具、抑制无关工具，并根据任务需求调整工具使用频率。这些能力使其在多项具有挑战性的基准测试中达到最先进性能，将 7B 基础模型的平均性能提升 +24.9%，并在包括 VSP 与 Jigsaw 在内的多个任务上超越 GPT-5 等强大的专有系统。