STEP: Warm-Started Visuomotor Policies with Spatiotemporal Consistency Prediction

Diffusion policies have recently emerged as a powerful paradigm for visuomotor control in robotic manipulation due to their ability to model the distribution of action sequences and capture multimodality. However, iterative denoising leads to substantial inference latency, limiting control frequency in real-time closed-loop systems. Existing acceleration methods either reduce sampling steps, bypass diffusion through direct prediction, or reuse past actions, but often struggle to jointly preserve action quality and achieve consistently low latency. In this work, we propose STEP, a lightweight spatiotemporal consistency prediction mechanism to construct high-quality warm-start actions that are both distributionally close to the target action and temporally consistent, without compromising the generative capability of the original diffusion policy. Then, we propose a velocity-aware perturbation injection mechanism that adaptively modulates actuation excitation based on temporal action variation to prevent execution stall especially for real-world tasks. We further provide a theoretical analysis showing that the proposed prediction induces a locally contractive mapping, ensuring convergence of action errors during diffusion refinement. We conduct extensive evaluations on nine simulated benchmarks and two real-world tasks. Notably, STEP with 2 steps can achieve an average 21.6% and 27.5% higher success rate than BRIDGER and DDIM on the RoboMimic benchmark and real-world tasks, respectively. These results demonstrate that STEP consistently advances the Pareto frontier of inference latency and success rate over existing methods.

翻译：扩散策略因其能够建模动作序列分布并捕捉多模态特性，已成为机器人操作中视觉运动控制的重要范式。然而，迭代去噪过程会导致显著的推理延迟，限制了实时闭环系统中的控制频率。现有加速方法或通过减少采样步数、绕过扩散过程直接预测，或复用历史动作，但往往难以在保持动作质量的同时实现持续低延迟。本文提出STEP，一种轻量级时空一致性预测机制，用于构建高质量的热启动动作。该机制在不损害原始扩散策略生成能力的前提下，确保动作在分布上接近目标动作且保持时间一致性。进一步，我们提出速度感知扰动注入机制，该机制基于动作时序变化自适应调节驱动激励，以防止执行停滞，在现实任务中尤为有效。我们通过理论分析证明，所提出的预测机制会诱导局部收缩映射，从而确保扩散优化过程中动作误差的收敛性。我们在九个仿真基准和两项现实任务中进行了广泛评估。值得注意的是，在RoboMimic基准和现实任务中，仅需2步采样的STEP相比BRIDGER和DDIM分别平均获得21.6%和27.5%更高的成功率。这些结果表明，STEP在推理延迟与成功率的帕累托边界上持续超越了现有方法。