Training expressive flow-based policies with off-policy reinforcement learning is notoriously unstable due to gradient pathologies in the multi-step action sampling process. We trace this instability to a fundamental connection: the flow rollout is algebraically equivalent to a residual recurrent computation, making it susceptible to the same vanishing and exploding gradients as RNNs. To address this, we reparameterize the velocity network using principles from modern sequential models, introducing two stable architectures: Flow-G, which incorporates a gated velocity, and Flow-T, which utilizes a decoded velocity. We then develop a practical SAC-based algorithm, enabled by a noise-augmented rollout, that facilitates direct end-to-end training of these policies. Our approach supports both from-scratch and offline-to-online learning and achieves state-of-the-art performance on continuous control and robotic manipulation benchmarks, eliminating the need for common workarounds like policy distillation or surrogate objectives.
翻译:利用离策略强化学习训练表达能力强的流式策略时,由于多步动作采样过程中的梯度病态问题,其训练过程极不稳定。我们将这种不稳定性追溯至一个基本关联:流式展开在代数上等价于残差递归计算,使其与循环神经网络(RNN)一样易受梯度消失和爆炸的影响。为解决此问题,我们借鉴现代序列模型的设计原则,对速度网络进行重参数化,提出了两种稳定架构:Flow-G(采用门控速度)和Flow-T(采用解码速度)。随后,我们开发了一种基于SAC的实用算法,通过噪声增强的展开过程,实现了对这些策略的直接端到端训练。我们的方法同时支持从零开始学习以及离线到在线学习,并在连续控制和机器人操作基准测试中取得了最先进的性能,无需依赖策略蒸馏或代理目标等常见变通方案。