Diffusion and flow-based generative policies provide a powerful policy class for reinforcement learning by inducing rich stochastic exploration through iterative action generation. However, the stochasticity of diffusion policies is not suitable for stable and precise control in high-dimensional robotic systems, where small action variations can accumulate into inconsistent motion and reduced robustness. To address this issue, we propose SteerGenPO, a latent-space reinforcement learning framework that steers a trained generative policy into a robust deterministic robotic controller. The key idea is to replace stochastic latent sampling of the trained generative policy with a learned latent actor that predicts a state-dependent latent input for the generative policies. This separates exploration and control: stochastic generative sampling provides diverse action proposals during policy learning, while deterministic latent steering provides stable and adaptive control at deployment. We evaluate SteerGenPO on six Isaac Lab benchmarks and a Unitree G1 locomotion task. The results show SteerGenPO improves over both classical RL and generative RL baselines, while its deterministic latent steering produces more stable inference-time behaviors and more reliable command responses.
翻译:扩散与流式生成策略通过学习迭代式动作生成引发丰富的随机探索,为强化学习提供了一类强大的策略函数。然而,扩散策略的随机性并不适用于高维机器人系统中的稳定精确控制——在该类系统中,微小的动作波动可能累积为不一致的运动并降低鲁棒性。为解决这一问题,我们提出SteerGenPO,一个潜在空间强化学习框架,可将训练好的生成策略导向为鲁棒的确定性机器人控制器。其核心思想是:用学习到的潜在动作器替代训练后生成策略的随机潜在采样,该动作器能为生成策略预测依赖于状态的潜在输入。这实现了探索与控制分离:随机生成采样为策略学习提供多样化动作提议,而确定性潜在导向则在部署时提供稳定且自适应的控制。我们在六个Isaac Lab基准测试及一个Unitree G1运动控制任务上评估了SteerGenPO。结果表明,SteerGenPO相较于经典RL和生成式RL基线均有提升,其确定性潜在导向能产生更稳定的推理时行为与更可靠的指令响应。