Large language models split into two families: reasoning-centric LLMs, which strengthen internal chain-of-thought reasoning but cannot invoke external tools, and agentic LLMs, which learn to interact with environments and leverage tools but often lag in deep reasoning. This divide arises from fundamentally different training objectives, leading to mismatched strengths and inefficiency on simple queries, where both families tend to overthink or over-call tools. In this work, we present Adaptive Agent Foundation Model (A$^2$FM), a unified framework that follows a route-then-align principle: the model first learns task-aware routing and then aligns mode-specific trajectories under a shared backbone. To address the inefficiency gap, we introduce a third mode-instant-that handles simple queries directly, preventing unnecessary reasoning or tool calls while complementing the agentic and reasoning modes. To jointly enhance accuracy and efficiency, we propose Adaptive Policy Optimization (APO), which enforces adaptive sampling across modes and applies a cost-regularized reward. On the 32B scale, A$^2$FM achieves 13.4% on BrowseComp, 70.4% on AIME25, and 16.7% on HLE, setting new SOTA among comparable models and performing competitively with frontier LLMs across agentic, reasoning, and general benchmarks. Notably, the adaptive execution achieves a cost of pass of only $0.00487 per correct answer-cutting cost by 45.2% relative to reasoning and 33.5% relative to agentic, thus delivering substantially higher cost efficiency while maintaining comparable accuracy.
翻译:大型语言模型主要分为两个类别:以推理为中心的LLMs,它们强化了内部的思维链推理能力但无法调用外部工具;以及智能体式LLMs,它们学习与环境交互并利用工具,但在深度推理方面往往表现滞后。这种分野源于根本不同的训练目标,导致两者优势错配,并且在处理简单查询时效率低下——这两类模型都倾向于过度思考或过度调用工具。在本工作中,我们提出了自适应智能体基础模型(A$^2$FM),这是一个遵循“先路由后对齐”原则的统一框架:模型首先学习任务感知的路由,然后在共享骨干网络下对齐特定模式的轨迹。为了解决效率差距,我们引入了第三种模式——即时模式——直接处理简单查询,从而避免不必要的推理或工具调用,同时补充智能体模式和推理模式。为了共同提升准确性和效率,我们提出了自适应策略优化(APO),它强制跨模式的自适应采样,并应用成本正则化的奖励。在320亿参数规模上,A$^2$FM在BrowseComp上达到13.4%,在AIME25上达到70.4%,在HLE上达到16.7%,在同类模型中创造了新的最高性能,并在智能体、推理和通用基准测试中与前沿LLMs竞争性地匹敌。值得注意的是,自适应执行实现了每次正确答案仅$0.00487的成本——相较于推理模式成本降低45.2%,相较于智能体模式降低33.5%,从而在保持相当准确性的同时,显著提高了成本效益。