Drift-Based Policy Optimization: Native One-Step Policy Learning for Online Robot Control

Although multi-step generative policies achieve strong performance in robotic manipulation by modeling multimodal action distributions, they require multi-step iterative denoising at inference time. Each action therefore needs tens to hundreds of network function evaluations (NFEs), making them costly for high-frequency closed-loop control and online reinforcement learning (RL). To address this limitation, we propose a two-stage framework for native one-step generative policies that shifts refinement from inference to training. First, we introduce the Drift-Based Policy (DBP), which leverages fixed-point drifting objectives to internalize iterative refinement into the model parameters, yielding a one-step generative backbone by design while preserving multimodal action modeling capacity. Second, we develop Drift-Based Policy Optimization (DBPO), an online RL framework that equips the pretrained backbone with a compatible stochastic interface, enabling stable on-policy updates without sacrificing the one-step deployment property. Extensive experiments demonstrate the effectiveness of the proposed framework across offline imitation learning, online fine-tuning, and real-world control scenarios. DBP matches or exceeds the performance of multi-step diffusion policies while achieving up to $100\times$ faster inference. It also consistently outperforms existing one-step baselines on challenging manipulation benchmarks. Moreover, DBPO enables effective and stable policy improvement in online settings. Experiments on a real-world dual-arm robot demonstrate reliable high-frequency control at 105.2 Hz.

翻译：尽管多步生成式策略通过对多模态动作分布建模在机器人操作中取得了优异性能，但它们在推理时需要执行多步迭代去噪过程。这导致每个动作需要数十到数百次网络函数评估（NFEs），使其难以应用于高频闭环控制和在线强化学习（RL）。为解决这一局限，我们提出了一种两阶段框架，将精细优化过程从推理阶段转移至训练阶段，从而实现原生一步式生成策略。首先，我们引入基于漂移的策略（DBP），该方法利用不动点漂移目标将迭代优化内化到模型参数中，在保留多模态动作建模能力的同时，天然构建出一步式生成骨干网络。其次，我们开发了基于漂移的策略优化（DBPO）在线RL框架，为预训练骨干网络配备兼容的随机接口，在不牺牲一步式部署特性的前提下实现稳定的在策略更新。大量实验证明了该框架在离线模仿学习、在线微调及真实世界控制场景中的有效性。DBP在性能达到或超越多步扩散策略的同时，推理速度提升高达$100\times$。在具有挑战性的操作基准测试中，该方法始终优于现有的一步式基线方案。此外，DBPO能够在在线场景中实现有效且稳定的策略改进。在真实世界双臂机器人上的实验证明了该方法在105.2 Hz频率下的可靠高频控制能力。