Training reinforcement learning (RL) systems in real-world environments remains challenging due to noisy supervision and poor out-of-domain (OOD) generalization, especially in LLM post-training. Recent distributional RL methods improve robustness by modeling values with multiple quantile points, but they still learn each quantile independently as a scalar. This results in rough-grained value representations that lack fine-grained conditioning on state information, struggling under complex and OOD conditions. We propose DFPO (Distributional Value Flow Policy Optimization with Conditional Risk and Consistency Control), a robust distributional RL framework that models values as continuous flows across time steps. By scaling value modeling through learning of a value flow field instead of isolated quantile predictions, DFPO captures richer state information for more accurate advantage estimation. To stabilize training under noisy feedback, DFPO further integrates conditional risk control and consistency constraints along value flow trajectories. Experiments on dialogue, math reasoning, and scientific tasks show that DFPO outperforms PPO, FlowRL, and other robust baselines under noisy supervision, achieving improved training stability and generalization.
翻译:在现实环境中训练强化学习(RL)系统仍然具有挑战性,这主要源于噪声监督和较差的域外(OOD)泛化能力,在大语言模型(LLM)后训练中尤其如此。近期的分布式强化学习方法通过使用多个分位数点对价值进行建模来提升鲁棒性,但它们仍将每个分位数作为独立的标量进行学习。这导致了粗粒度的价值表示,缺乏对状态信息的细粒度条件化,在复杂和OOD条件下表现不佳。我们提出了DFPO(具有条件风险与一致性控制的分布式价值流策略优化),这是一个鲁棒的分布式强化学习框架,它将价值建模为跨时间步的连续流。通过学习价值流场而非孤立的分位数预测来实现价值建模的规模化,DFPO能够捕捉更丰富的状态信息,从而进行更准确的优势估计。为了在噪声反馈下稳定训练,DFPO进一步沿价值流轨迹集成了条件风险控制和一致性约束。在对话、数学推理和科学任务上的实验表明,在噪声监督下,DFPO的表现优于PPO、FlowRL及其他鲁棒基线,实现了更好的训练稳定性和泛化能力。