Thanks to their remarkable flexibility, diffusion models and flow models have emerged as promising candidates for policy representation. However, efficient reinforcement learning (RL) upon these policies remains a challenge due to the lack of explicit log-probabilities for vanilla policy gradient estimators. While numerous attempts have been proposed to address this, the field lacks a unified perspective to reconcile these seemingly disparate methods, thus hampering ongoing development. In this paper, we bridge this gap by introducing a comprehensive taxonomy for RL algorithms with diffusion/flow policies. To support reproducibility and agile prototyping, we introduce a modular, JAX-based open-source codebase that leverages JIT-compilation for high-throughput training. Finally, we provide systematic and standardized benchmarks across Gym-Locomotion, DeepMind Control Suite, and IsaacLab, offering a rigorous side-by-side comparison of diffusion-based methods and guidance for practitioners to choose proper algorithms based on the application. Our work establishes a clear foundation for understanding and algorithm design, a high-efficiency toolkit for future research in the field, and an algorithmic guideline for practitioners in generative models and robotics. Our code is available at https://github.com/typoverflow/flow-rl.
翻译:由于其卓越的灵活性,扩散模型与流模型已成为策略表示的有力候选。然而,由于缺乏用于标准策略梯度估计器的显式对数概率,在这些策略上进行高效的强化学习仍是一项挑战。尽管已有众多方法试图解决该问题,但该领域缺乏一个统一的视角来调和这些看似各异的方法,从而阻碍了其持续发展。在本文中,我们通过引入一个针对扩散/流策略的强化学习算法的综合分类学,弥补了这一空白。为支持可复现性与敏捷原型开发,我们介绍了一个基于JAX的模块化开源代码库,该库利用即时编译技术实现高吞吐量训练。最后,我们提供了涵盖Gym-Locomotion、DeepMind Control Suite和IsaacLab的系统化标准化基准测试,对基于扩散的方法进行了严格的横向对比,并为实践者根据具体应用选择合适算法提供了指导。我们的工作为理解与算法设计奠定了清晰基础,为未来该领域的研究提供了高效工具包,并为生成模型与机器人领域的实践者提供了算法指南。代码已开源:https://github.com/typoverflow/flow-rl。