可扩展机器人控制的在线扩散策略强化学习算法综述 (A Review of Online Diffusion Policy RL Algorithms for Scalable Robotic Control)

Diffusion policies have emerged as a powerful approach for robotic control, demonstrating superior expressiveness in modeling multimodal action distributions compared to conventional policy networks. However, their integration with online reinforcement learning remains challenging due to fundamental incompatibilities between diffusion model training objectives and standard RL policy improvement mechanisms. This paper presents the first comprehensive review and empirical analysis of current Online Diffusion Policy Reinforcement Learning (Online DPRL) algorithms for scalable robotic control systems. We propose a novel taxonomy that categorizes existing approaches into four distinct families--Action-Gradient, Q-Weighting, Proximity-Based, and Backpropagation Through Time (BPTT) methods--based on their policy improvement mechanisms. Through extensive experiments on a unified NVIDIA Isaac Lab benchmark encompassing 12 diverse robotic tasks, we systematically evaluate representative algorithms across five critical dimensions: task diversity, parallelization capability, diffusion step scalability, cross-embodiment generalization, and environmental robustness. Our analysis identifies key findings regarding the fundamental trade-offs inherent in each algorithmic family, particularly concerning sample efficiency and scalability. Furthermore, we reveal critical computational and algorithmic bottlenecks that currently limit the practical deployment of online DPRL. Based on these findings, we provide concrete guidelines for algorithm selection tailored to specific operational constraints and outline promising future research directions to advance the field toward more general and scalable robotic learning systems.

翻译：扩散策略已成为机器人控制的一种强大方法，相较于传统策略网络，其在建模多模态动作分布方面展现出卓越的表达能力。然而，由于扩散模型训练目标与标准强化学习策略改进机制之间存在根本性不兼容，将其与在线强化学习相结合仍然具有挑战性。本文首次对当前面向可扩展机器人控制系统的在线扩散策略强化学习算法进行了全面综述与实证分析。我们提出了一种新颖的分类法，根据策略改进机制将现有方法划分为四个不同的类别——动作梯度法、Q值加权法、邻近度法以及通过时间的反向传播法。通过在统一的NVIDIA Isaac Lab基准测试平台上对12种不同机器人任务进行广泛实验，我们从任务多样性、并行化能力、扩散步骤可扩展性、跨具身泛化能力以及环境鲁棒性这五个关键维度，系统评估了代表性算法。我们的分析揭示了每个算法家族所固有的基本权衡，尤其是在样本效率与可扩展性方面。此外，我们揭示了当前限制在线扩散策略强化学习实际部署的关键计算与算法瓶颈。基于这些发现，我们为针对特定操作约束的算法选择提供了具体指导原则，并概述了有前景的未来研究方向，以推动该领域朝着更通用、更可扩展的机器人学习系统发展。