Agentic Critical Training

Training large language models (LLMs) as autonomous agents often begins with imitation learning, but it only teaches agents what to do without understanding why: agents never contrast successful actions against suboptimal alternatives and thus lack awareness of action quality. Recent approaches attempt to address this by introducing self-reflection supervision derived from contrasts between expert and alternative actions. However, the training paradigm fundamentally remains imitation learning: the model imitates pre-constructed reflection text rather than learning to reason autonomously. We propose Agentic Critical Training (ACT), a reinforcement learning paradigm that trains agents to identify the better action among alternatives. By rewarding whether the model's judgment is correct, ACT drives the model to autonomously develop reasoning about action quality, producing genuine self-reflection rather than imitating it. Across three challenging agent benchmarks, ACT consistently improves agent performance when combined with different post-training methods. It achieves an average improvement of 5.07 points over imitation learning and 4.62 points over reinforcement learning. Compared to approaches that inject reflection capability through knowledge distillation, ACT also demonstrates clear advantages, yielding an average improvement of 2.42 points. Moreover, ACT enables strong out-of-distribution generalization on agentic benchmarks and improves performance on general reasoning benchmarks without any reasoning-specific training data, highlighting the value of our method. These results suggest that ACT is a promising path toward developing more reflective and capable LLM agents.

翻译：将大型语言模型（LLM）作为自主智能体进行训练通常始于模仿学习，但这种方法仅教导智能体执行何种动作，而未理解其内在原因：智能体从未将成功动作与次优替代方案进行对比，因而缺乏对动作质量的认知。近期研究尝试通过引入基于专家动作与替代动作对比的自我反思监督来解决此问题。然而，训练范式本质上仍属于模仿学习：模型仅模仿预先构建的反思文本，而非学习自主推理。本文提出智能体批判性训练（ACT），这是一种强化学习范式，通过训练智能体在替代动作中识别更优动作来实现突破。通过奖励模型判断的正确性，ACT驱动模型自主形成对动作质量的推理能力，从而产生真正的自我反思而非简单模仿。在三个具有挑战性的智能体基准测试中，ACT与不同后训练方法结合时均能持续提升智能体性能。相较于模仿学习平均提升5.07分，相较于强化学习平均提升4.62分。与通过知识蒸馏注入反思能力的方法相比，ACT同样展现出明显优势，实现平均2.42分的提升。此外，ACT在智能体基准测试中展现出强大的分布外泛化能力，并在未使用任何推理专项训练数据的情况下，提升了通用推理基准测试的性能，凸显了本方法的实用价值。这些结果表明，ACT是开发更具反思性和能力的大型语言模型智能体的一条可行路径。