Language model agents are increasingly effective in solving realistic tasks through multi-turn tool use. However, training reliable tool-using agents remains challenging in practice. While reinforcement learning provides an on-policy paradigm for improving agents from their own environment interactions, its effectiveness depends heavily on the training task distribution. When tasks are fixed before training, the task distribution can become increasingly mismatched with the policy's evolving capabilities, causing many rollouts to be spent on uninformative tasks. We propose SENTINEL, a failure-driven reinforcement learning framework that turns the Solver's rollout failures into targeted training tasks. SENTINEL follows a Controller--Proposer--Solver loop: the Controller analyzes failed trajectories and summarizes recurring error patterns, the Proposer generates executable tasks that stress these weaknesses, and the Solver is trained on the targeted tasks. On Tau2-Bench Retail with Qwen3-4B-Thinking-2507, SENTINEL improves Pass\^{}1 from 66.4 to 74.9 and outperforms RL on general synthetic tasks across Pass\^{}k metrics. These results demonstrate that model failures provide an effective and scalable source of targeted training signal for improving tool-using language model agents.
翻译:语言模型智能体通过多轮工具使用在解决现实任务中日益高效。然而,训练可靠的工具使用智能体在实践中仍具挑战。尽管强化学习为从智能体自身环境交互中改进策略提供了一种在策略范式,但其效果高度依赖于训练任务分布。当任务在训练前固定时,任务分布与策略不断演进的能力之间可能出现日益严重的错配,导致大量轨迹生成资源浪费在无信息任务上。我们提出SENTINEL,一种将求解器轨迹失败转化为针对性训练任务的失败驱动强化学习框架。SENTINEL遵循控制器-提议器-求解器循环:控制器分析失败轨迹并总结重复错误模式,提议器生成会突显这些弱点的可执行任务,而后求解器在针对性任务上接受训练。在基于Qwen3-4B-Thinking-2507的Tau2-Bench零售场景中,SENTINEL将Pass\^1从66.4提升至74.9,并在Pass\^k指标上全面优于对通用合成任务的强化学习。这些结果表明,模型失败为改进工具使用语言模型智能体提供了有效且可扩展的针对性训练信号来源。