Learning long-horizon robotic manipulation requires jointly achieving expressive behavior modeling, real-time inference, and stable execution, which remains challenging for existing generative policies. Diffusion-based approaches provide strong modeling capacity but typically incur high inference latency, while flow matching enables fast one-step generation yet often leads to unstable execution when applied directly in the raw action space. We propose LG-Flow Policy, a trajectory-level imitation learning framework that performs flow matching in a continuous latent action space. By encoding action sequences into temporally regularized latent trajectories and learning an explicit latent-space flow, the proposed approach decouples global motion structure from low-level control noise, resulting in smooth and reliable long-horizon execution. LG-Flow Policy further incorporates geometry-aware point cloud conditioning and execution-time multimodal modulation, with visual cues evaluated as a representative modality in real-world settings. Experimental results in simulation and on physical robot platforms demonstrate that LG-Flow Policy achieves near single-step inference, substantially improves trajectory smoothness and task success over flow-based baselines operating in the raw action space, and remains significantly more efficient than diffusion-based policies.
翻译:学习长时程机器人操作需要同时实现表达性行为建模、实时推理和稳定执行,这对现有生成策略仍具挑战性。基于扩散的方法提供了强大的建模能力,但通常导致较高的推理延迟;而流匹配虽能实现快速单步生成,在原始动作空间中直接应用时却常导致执行不稳定。我们提出LG-Flow策略,一种在连续潜在动作空间中进行流匹配的轨迹级模仿学习框架。通过将动作序列编码为时序正则化的潜在轨迹并学习显式的潜在空间流,该方法实现了全局运动结构与底层控制噪声的解耦,从而产生平滑可靠的长时程执行。LG-Flow策略进一步整合了几何感知的点云条件机制与执行时多模态调制功能,其中视觉线索在真实场景中作为代表性模态进行评估。仿真和物理机器人平台上的实验结果表明,LG-Flow策略实现了接近单步的推理速度,相比在原始动作空间运行的流匹配基线方法显著提升了轨迹平滑度与任务成功率,同时仍比基于扩散的策略更为高效。