Generative modeling-based visuomotor policies have been widely adopted in robotic manipulation, attributed to their ability to model multimodal action distributions. However, the high inference cost of multi-step sampling limits its applicability in real-time robotic systems. Existing approaches accelerate sampling in generative modeling-based visuomotor policies by adapting techniques originally developed to speed up image generation. However, a major distinction exists: image generation typically produces independent samples without temporal dependencies, while robotic manipulation requires generating action trajectories with continuity and temporal coherence. To this end, we propose FreqPolicy, a novel approach that first imposes frequency consistency constraints on flow-based visuomotor policies. Our work enables the action model to capture temporal structure effectively while supporting efficient, high-quality one-step action generation. Concretely, we introduce a frequency consistency constraint objective that enforces alignment of frequency-domain action features across different timesteps along the flow, thereby promoting convergence of one-step action generation toward the target distribution. In addition, we design an adaptive consistency loss to capture structural temporal variations inherent in robotic manipulation tasks. We assess FreqPolicy on 53 tasks across 3 simulation benchmarks, proving its superiority over existing one-step action generators. We further integrate FreqPolicy into the vision-language-action (VLA) model and achieve acceleration without performance degradation on 40 tasks of LIBERO. Besides, we show efficiency and effectiveness in real-world robotic scenarios with an inference frequency of 93.5 Hz.
翻译:基于生成建模的视觉运动策略因其能够建模多模态动作分布而在机器人操作中得到广泛应用。然而,多步采样的高推理成本限制了其在实时机器人系统中的适用性。现有方法通过采用最初为加速图像生成而开发的技术,来加速基于生成建模的视觉运动策略的采样。然而,存在一个主要区别:图像生成通常产生没有时间依赖性的独立样本,而机器人操作则需要生成具有连续性和时间一致性的动作轨迹。为此,我们提出了FreqPolicy,一种新颖的方法,首先在基于流的视觉运动策略上施加频率一致性约束。我们的工作使动作模型能够有效捕捉时间结构,同时支持高效、高质量的一步动作生成。具体而言,我们引入了一个频率一致性约束目标,该目标强制沿流的不同时间步的动作特征在频域上对齐,从而促进一步动作生成向目标分布收敛。此外,我们设计了一种自适应一致性损失,以捕捉机器人操作任务中固有的结构时间变化。我们在3个模拟基准测试的53个任务上评估了FreqPolicy,证明了其优于现有的一步动作生成器。我们进一步将FreqPolicy集成到视觉-语言-动作(VLA)模型中,并在LIBERO的40个任务上实现了加速且性能无下降。此外,我们在推理频率为93.5 Hz的真实机器人场景中展示了其效率和有效性。