Reinforcement learning (RL) has become a central paradigm for post-training large language models. Existing critic-free RL methods typically generate a group of rollouts for the same question to estimate value baselines for advantage computation. However, this design suffers from data inefficiency, group synchronization barriers, and inflexibility with structured rollouts. In this work, we revisit the role of the ``group'' and show that its underlying function is not merely to estimate baselines but to prevent false penalties on negative samples. Building on this insight, we propose negative token filtering, a simple and effective strategy that enables stable single-rollout training. We apply it to two batch-level advantage methods, achieving comparable performance on reasoning tasks and stronger performance on agentic tasks relative to group-based RL techniques.
翻译:强化学习已成为大型语言模型后训练阶段的核心范式。现有的无评论强化学习方法通常会为同一问题生成一组采样轨迹来估算价值基线以计算优势函数。然而,这种设计存在数据利用率低、分组同步障碍、以及对结构化轨迹缺乏灵活性的问题。本研究重新审视了"分组"机制的作用,揭示其根本功能并非仅是估算基线,而是防止对负样本施加错误惩罚。基于这一发现,我们提出"负标记过滤"这一简单有效的策略,能够实现稳定的单轨迹训练。将该策略应用于两种批次级优势计算方法后,与基于分组的强化学习技术相比,在推理任务上取得了相当的性能,在智能体任务上获得了更优的表现。