Reinforcement learning with massively parallel simulations has become a standard framework for developing robust, deployable policies; however, most existing approaches still rely on simple Gaussian policy parameterizations. Diffusion models provide a more expressive policy class and have shown strong performance on challenging control problems, yet most diffusion-based RL methods are designed for offline or off-policy training. In this work, we ask whether diffusion policies can be trained effectively in the massively parallel, on-policy regime. To this end, we introduce Trust-region Diffusion Policies (TruDi), which enables diffusion policies for on-policy RL with massively parallel simulations. This setting is particularly challenging because the data distribution changes quickly across updates, making stable training with complex policies difficult. TruDi addresses this by integrating a trust-region optimization rule to enforce a KL-divergence constraint over the entire diffusion trajectory. Empirically, we evaluate TruDi on a diverse set of 4 massively parallel RL benchmarks comprising a total of 73 tasks. Across these tasks, TruDi consistently outperforms or is on-par with strong baselines on standard tasks and achieves clear gains on more challenging humanoid control tasks, establishing a strong new baseline for massively parallel on-policy RL.
翻译:利用大规模并行模拟进行强化学习已成为开发鲁棒、可部署策略的标准化框架,然而大多数现有方法仍依赖简单的高斯策略参数化。扩散模型提供了更具表达力的策略类别,在具有挑战性的控制问题上表现出色,但大多数基于扩散的强化学习方法是为离线或离策略训练设计的。在本工作中,我们探讨扩散策略能否在大规模并行在线策略场景下有效训练。为此,我们引入信任域扩散策略(TruDi),该方法支持在大规模并行模拟中进行在线策略强化学习。该场景尤其具有挑战性,因为数据分布在更新过程中快速变化,使得复杂策略的稳定训练变得困难。TruDi通过整合信任域优化规则来对完整扩散轨迹施加KL散度约束,从而解决该问题。在包含73个任务的4个大规模并行强化学习基准测试集上进行的实验表明,TruDi在标准任务上持续优于或持平于强基线方法,并在更具挑战性的人形机器人控制任务上取得明显优势,为大规规模并行在线策略强化学习建立了新的强基线。