Verified Critical Step Optimization for LLM Agents

As large language model agents tackle increasingly complex long-horizon tasks, effective post-training becomes critical. Prior work faces fundamental challenges: outcome-only rewards fail to precisely attribute credit to intermediate steps, estimated step-level rewards introduce systematic noise, and Monte Carlo sampling approaches for step reward estimation incur prohibitive computational cost. Inspired by findings that only a small fraction of high-entropy tokens drive effective RL for reasoning, we propose Critical Step Optimization (CSO), which focuses preference learning on verified critical steps, decision points where alternate actions demonstrably flip task outcomes from failure to success. Crucially, our method starts from failed policy trajectories rather than expert demonstrations, directly targeting the policy model's weaknesses. We use a process reward model (PRM) to identify candidate critical steps, leverage expert models to propose high-quality alternatives, then continue execution from these alternatives using the policy model itself until task completion. Only alternatives that the policy successfully executes to correct outcomes are verified and used as DPO training data, ensuring both quality and policy reachability. This yields fine-grained, verifiable supervision at critical decisions while avoiding trajectory-level coarseness and step-level noise. Experiments on GAIA-Text-103 and XBench-DeepSearch show that CSO achieves 37% and 26% relative improvement over the SFT baseline and substantially outperforms other post-training methods, while requiring supervision at only 16% of trajectory steps. This demonstrates the effectiveness of selective verification-based learning for agent post-training.

翻译：随着大语言模型智能体处理日益复杂的长期任务，有效的后训练变得至关重要。先前的工作面临根本性挑战：仅基于结果的奖励无法精确归因于中间步骤，估计的步骤级奖励会引入系统性噪声，而用于步骤奖励估计的蒙特卡洛采样方法则带来难以承受的计算成本。受"仅有一小部分高熵标记能有效驱动推理强化学习"这一发现的启发，我们提出了关键步骤优化（CSO），该方法将偏好学习聚焦于已验证的关键步骤——即那些可证明能通过不同行动将任务结果从失败翻转为成功的决策点。关键在于，我们的方法从失败的策略轨迹而非专家示范出发，直接针对策略模型的弱点。我们使用过程奖励模型（PRM）来识别候选关键步骤，利用专家模型提出高质量的替代行动，然后使用策略模型本身从这些替代行动继续执行直至任务完成。只有那些策略成功执行并产生正确结果的替代行动才会被验证并用作DPO训练数据，从而确保质量和策略可达性。这能在关键决策点提供细粒度、可验证的监督，同时避免了轨迹级监督的粗糙性和步骤级监督的噪声。在GAIA-Text-103和XBench-DeepSearch上的实验表明，CSO相比SFT基线分别实现了37%和26%的相对提升，并显著优于其他后训练方法，同时仅需对16%的轨迹步骤进行监督。这证明了基于选择性验证的学习对于智能体后训练的有效性。