Agents capable of accomplishing complex tasks through multiple interactions with the environment have emerged as a popular research direction. However, in such multi-step settings, the conventional group-level policy optimization algorithm becomes suboptimal because of its underlying assumption that each action holds equal contribution, which deviates significantly from reality. Our analysis reveals that only a small fraction of actions are critical in determining the final outcome. Building on this insight, we propose CARL, a critical-action-focused reinforcement learning algorithm tailored for long-horizon agentic reasoning. CARL leverages entropy as a heuristic proxy for action criticality and achieves focused training by assigning rewards to high-criticality actions while excluding low-criticality actions from model updates, avoiding noisy credit assignment and redundant computation. Extensive experiments demonstrate that CARL achieves both stronger performance and higher efficiency across diverse evaluation settings. The source code will be publicly available.
翻译:能够通过与环境进行多次交互来完成复杂任务的智能体已成为一个热门研究方向。然而,在这种多步交互场景中,传统的群体层面策略优化算法因其假设每个行动对最终结果具有同等贡献而变得次优,这一假设与现实情况存在显著偏差。我们的分析表明,仅有少数关键行动对最终结果起决定性作用。基于这一发现,我们提出了CARL——一种专为长时程智能体推理设计的、聚焦关键行动的强化学习算法。CARL以熵作为行动关键性的启发式代理指标,通过向高关键性行动分配奖励,同时将低关键性行动排除在模型更新之外,实现聚焦式训练,从而避免噪声信用分配和冗余计算。大量实验表明,CARL在多种评估场景中均实现了更强的性能和更高的效率。源代码将公开提供。