Agentic Reinforcement Learning (RL) has empowered Large Language Models (LLMs) to utilize tools like Python interpreters for complex problem-solving. However, for parameter-constrained models (e.g., 4B--7B), the exploration phase is often plagued by frequent execution failures, creating noisy trajectories that hinder policy optimization. Under standard outcome-based reward settings, this noise leads to a critical credit assignment issue, where erroneous actions are inadvertently reinforced alongside successful outcomes. Existing mitigations face a dilemma: dense rewards often trigger reward hacking, while supersampling incurs prohibitive computational costs. To address these challenges, we propose CLEANER. Distinct from external filtering methods, CLEANER exploits the model's intrinsic self-correction capabilities to eliminate error-contaminated context directly during data collection. At its core, the Similarity-Aware Adaptive Rollback (SAAR) mechanism autonomously constructs clean, purified trajectories by retrospectively replacing failures with successful self-corrections. Based on semantic similarity, SAAR adaptively regulates replacement granularity from shallow execution repairs to deep reasoning substitutions. By training on these self-purified paths, the model internalizes correct reasoning patterns rather than error-recovery loops. Empirical results on AIME24/25, GPQA, and LiveCodeBench show average accuracy gains of 6%, 3%, and 5% over baselines. Notably, CLEANER matches state-of-the-art performance using only one-third of the training steps, highlighting trajectory purification as a scalable solution for efficient agentic RL. Our models and code are available at GitHub
翻译:智能体强化学习(RL)已赋能大型语言模型(LLMs)利用Python解释器等工具解决复杂问题。然而,对于参数受限模型(如4B–7B),探索阶段常因频繁的执行失败而受阻,产生阻碍策略优化的噪声轨迹。在标准基于结果的奖励设置下,这种噪声会导致严重的信用分配问题,即错误动作与成功结果被一并错误地强化。现有缓解方法面临两难困境:密集奖励常引发奖励破解,而过采样则带来难以承受的计算成本。为应对这些挑战,我们提出CLEANER。与外部过滤方法不同,CLEANER利用模型内在的自我纠正能力,直接在数据收集阶段消除错误污染的上下文。其核心是相似性感知自适应回滚(SAAR)机制,该机制通过回溯性地将失败替换为成功的自我修正,自主构建洁净的净化轨迹。基于语义相似性,SAAR自适应地调节替换粒度,从浅层执行修复到深层推理替换。通过在这些自净化路径上训练,模型内化了正确的推理模式而非错误恢复循环。在AIME24/25、GPQA和LiveCodeBench上的实证结果显示,相较于基线方法平均准确率分别提升6%、3%和5%。值得注意的是,CLEANER仅用三分之一的训练步数即达到最先进性能,凸显了轨迹净化作为高效智能体RL的可扩展解决方案。我们的模型与代码已在GitHub开源。