二分扩散策略优化 (Dichotomous Diffusion Policy Optimization)

Diffusion-based policies have gained growing popularity in solving a wide range of decision-making tasks due to their superior expressiveness and controllable generation during inference. However, effectively training large diffusion policies using reinforcement learning (RL) remains challenging. Existing methods either suffer from unstable training due to directly maximizing value objectives, or face computational issues due to relying on crude Gaussian likelihood approximation, which requires a large amount of sufficiently small denoising steps. In this work, we propose DIPOLE (Dichotomous diffusion Policy improvement), a novel RL algorithm designed for stable and controllable diffusion policy optimization. We begin by revisiting the KL-regularized objective in RL, which offers a desirable weighted regression objective for diffusion policy extraction, but often struggles to balance greediness and stability. We then formulate a greedified policy regularization scheme, which naturally enables decomposing the optimal policy into a pair of stably learned dichotomous policies: one aims at reward maximization, and the other focuses on reward minimization. Under such a design, optimized actions can be generated by linearly combining the scores of dichotomous policies during inference, thereby enabling flexible control over the level of greediness.Evaluations in offline and offline-to-online RL settings on ExORL and OGBench demonstrate the effectiveness of our approach. We also use DIPOLE to train a large vision-language-action (VLA) model for end-to-end autonomous driving (AD) and evaluate it on the large-scale real-world AD benchmark NAVSIM, highlighting its potential for complex real-world applications.

翻译：基于扩散的策略因其卓越的表达能力和推理过程中的可控生成特性，在解决各类决策任务中日益受到青睐。然而，利用强化学习有效训练大型扩散策略仍然具有挑战性。现有方法要么因直接最大化价值目标而导致训练不稳定，要么因依赖粗略的高斯似然近似而面临计算问题，后者需要大量且足够小的去噪步骤。本文提出DIPOLE（二分扩散策略改进），一种专为稳定且可控的扩散策略优化设计的新型强化学习算法。我们首先重新审视强化学习中的KL正则化目标，该目标为扩散策略提取提供了一个理想的加权回归目标，但常常难以平衡贪婪性与稳定性。随后，我们构建了一种贪婪化策略正则化方案，该方案自然地将最优策略分解为一对稳定学习的二分策略：一个旨在奖励最大化，另一个专注于奖励最小化。在此设计下，优化后的动作可通过在推理过程中线性组合二分策略的分数来生成，从而实现对贪婪程度的灵活控制。在ExORL和OGBench上的离线及离线到在线强化学习设置中的评估证明了我们方法的有效性。我们还使用DIPOLE训练了一个大型视觉-语言-动作模型，用于端到端自动驾驶，并在大规模真实世界自动驾驶基准NAVSIM上对其进行了评估，凸显了其在复杂现实应用中的潜力。