Limited by inference latency, existing robot manipulation policies lack sufficient real-time interaction capability with the environment. Although faster generation methods such as flow matching are gradually replacing diffusion methods, researchers are pursuing even faster generation suitable for interactive robot control. MeanFlow, as a one-step variant of flow matching, has shown strong potential in image generation, but its precision in action generation does not meet the stringent requirements of robotic manipulation. We therefore propose \textbf{HybridFlow}, a \textbf{3-stage method} with \textbf{2-NFE}: Global Jump in MeanFlow mode, ReNoise for distribution alignment, and Local Refine in ReFlow mode. This method balances inference speed and generation quality by leveraging the rapid advantage of MeanFlow one-step generation while ensuring action precision with minimal generation steps. Through real-world experiments, HybridFlow outperforms the 16-step Diffusion Policy by \textbf{15--25\%} in success rate while reducing inference time from 152ms to 19ms (\textbf{8$\times$ speedup}, \textbf{$\sim$52Hz}); it also achieves 70.0\% success on unseen-color OOD grasping and 66.3\% on deformable object folding. We envision HybridFlow as a practical low-latency method to enhance real-world interaction capabilities of robotic manipulation policies.
翻译:受限于推理延迟,现有机器人操作策略缺乏与环境进行充分实时交互的能力。尽管流匹配等更快的生成方法正逐步替代扩散方法,研究人员仍在追求适用于交互式机器人控制的更快速生成方案。MeanFlow作为流匹配的一步式变体,在图像生成中已展现出强大潜力,但其在动作生成方面的精度尚无法满足机器人操作的严格要求。为此,我们提出\textbf{HybridFlow}——一种包含\textbf{2-NFE}的\textbf{三阶段方法}:MeanFlow模式下的全局跳跃、分布对齐的ReNoise以及ReFlow模式下的局部细化。该方法通过利用MeanFlow一步式生成的快速优势,同时以最少的生成步骤保证动作精度,从而平衡了推理速度与生成质量。通过真实世界实验,HybridFlow在成功率上以\textbf{15–25\%}的优势超越16步扩散策略,同时将推理时间从152毫秒降至19毫秒(\textbf{8倍加速},\textbf{$\sim$52Hz});在未见颜色OOD抓取任务中达到70.0\%的成功率,在可变形物体折叠任务中达到66.3\%的成功率。我们期待HybridFlow能成为一种实用的低延迟方法,以增强机器人操作策略在真实世界中的交互能力。