Recent advances in reinforcement learning (RL) have demonstrated the powerful exploration capabilities and multimodality of generative diffusion-based policies. While substantial progress has been made in offline RL and off-policy RL settings, integrating diffusion policies into on-policy frameworks like PPO remains underexplored. This gap is particularly significant given the widespread use of large-scale parallel GPU-accelerated simulators, such as IsaacLab, which are optimized for on-policy RL algorithms and enable rapid training of complex robotic tasks. A key challenge lies in computing state-action log-likelihoods under diffusion policies, which is straightforward for Gaussian policies but intractable for flow-based models due to irreversible forward-reverse processes and discretization errors (e.g., Euler-Maruyama approximations). To bridge this gap, we propose GenPO, a generative policy optimization framework that leverages exact diffusion inversion to construct invertible action mappings. GenPO introduces a novel doubled dummy action mechanism that enables invertibility via alternating updates, resolving log-likelihood computation barriers. Furthermore, we also use the action log-likelihood for unbiased entropy and KL divergence estimation, enabling KL-adaptive learning rates and entropy regularization in on-policy updates. Extensive experiments on eight IsaacLab benchmarks, including legged locomotion (Ant, Humanoid, Anymal-D, Unitree H1, Go2), dexterous manipulation (Shadow Hand), aerial control (Quadcopter), and robotic arm tasks (Franka), demonstrate GenPO's superiority over existing RL baselines. Notably, GenPO is the first method to successfully integrate diffusion policies into on-policy RL, unlocking their potential for large-scale parallelized training and real-world robotic deployment.
翻译:近年来,强化学习(RL)领域的进展已证明基于生成扩散的策略在探索能力和多模态性方面具有强大优势。尽管离线RL和离线策略RL设置已取得显著进步,但将扩散策略整合到如PPO等在线策略框架中的研究仍显不足。考虑到大规模并行GPU加速仿真器(如IsaacLab)的广泛应用,这一空白尤为重要——此类仿真器专为在线策略RL算法优化,能够快速训练复杂机器人任务。一个关键挑战在于计算扩散策略下的状态-动作对数似然:对于高斯策略而言这是直接的,但对于基于流的模型,由于不可逆的前向-反向过程及离散化误差(例如欧拉-丸山近似),该计算变得难以处理。为弥合这一差距,我们提出GenPO,一个利用精确扩散反演构建可逆动作映射的生成式策略优化框架。GenPO引入了一种新颖的双重虚拟动作机制,通过交替更新实现可逆性,从而解决对数似然计算障碍。此外,我们还利用动作对数似然进行无偏的熵和KL散度估计,实现在线策略更新中的KL自适应学习率与熵正则化。在八个IsaacLab基准测试(包括足式运动(Ant、Humanoid、Anymal-D、Unitree H1、Go2)、灵巧操作(Shadow Hand)、空中控制(Quadcopter)和机械臂任务(Franka))上的大量实验证明了GenPO优于现有RL基线方法。值得注意的是,GenPO是首个成功将扩散策略整合到在线策略RL中的方法,释放了其在大规模并行化训练和现实世界机器人部署中的潜力。