Large language models (LLMs) are increasingly deployed as autonomous agents for multi-turn decision-making tasks. However, current agents typically rely on fixed cognitive patterns: non-thinking models generate immediate responses, while thinking models engage in deep reasoning uniformly. This rigidity is inefficient for long-horizon tasks, where cognitive demands vary significantly from step to step, with some requiring strategic planning and others only routine execution. In this paper, we introduce CogRouter, a framework that trains agents to dynamically adapt cognitive depth at each step. Grounded in ACT-R theory, we design four hierarchical cognitive levels ranging from instinctive responses to strategic planning. Our two-stage training approach includes Cognition-aware Supervised Fine-tuning (CoSFT) to instill stable level-specific patterns, and Cognition-aware Policy Optimization (CoPO) for step-level credit assignment via confidence-aware advantage reweighting. The key insight is that appropriate cognitive depth should maximize the confidence of the resulting action. Experiments on ALFWorld and ScienceWorld demonstrate that CogRouter achieves state-of-the-art performance with superior efficiency. With Qwen2.5-7B, it reaches an 82.3% success rate, outperforming GPT-4o (+40.3%), OpenAI-o3 (+18.3%), and GRPO (+14.0%), while using 62% fewer tokens.
翻译:大型语言模型(LLM)正越来越多地被部署为执行多轮决策任务的自主智能体。然而,现有智能体通常依赖固定的认知模式:非思考型模型直接生成即时响应,而思考型模型则始终进行深度推理。这种僵化机制在长视野任务中效率低下,因为不同步骤的认知需求差异显著——有些步骤需要战略规划,而另一些仅需常规执行。本文提出CogRouter框架,该框架训练智能体在每一步动态调整认知深度。基于ACT-R理论,我们设计了从本能响应到战略规划的四个层次化认知级别。我们的两阶段训练方法包括:通过认知感知监督微调(CoSFT)注入稳定的层级特定模式,以及通过置信度感知优势重加权的认知感知策略优化(CoPO)实现步骤级信用分配。核心洞见在于:恰当的认知深度应最大化最终行动的置信度。在ALFWorld和ScienceWorld上的实验表明,CogRouter以卓越的效率实现了最先进的性能。使用Qwen2.5-7B模型时,其成功率高达82.3%,显著超越GPT-4o(+40.3%)、OpenAI-o3(+18.3%)和GRPO(+14.0%),同时减少62%的令牌消耗。