Diffusion-based visuomotor policies operating directly in raw action spaces conflate scene comprehension with trajectory generation within a single denoising process. The resulting velocity field must simultaneously encode scene information and generate precise trajectories, increasing learning complexity and limiting performance on tasks demanding precise temporal coordination across multiple arms. To simplify this joint learning problem, we introduce Latent Diffusion Policy (LDP), a two-stage framework performing flow matching in a deliberately shaped latent space. By absorbing scene understanding into an observation-conditioned CVAE encoder, LDP concentrates the conditional distribution of each observation. Consequently, the flow model avoids implicitly resolving scene-dependent structures; instead, it generates within a pre-concentrated distribution featuring a smoother velocity field, simplifying learning from limited demonstrations. Furthermore, to capture temporal dependencies among latent tokens, LDP trains with per-token diffusion forcing and employs staircase inference sampling to resolve the resulting distributional mismatch. We also propose reconstruction FID (rFID) as a lightweight proxy predicting downstream task success solely from latent space statistics. On coordination-intensive tasks from RoboTwin 2.0, LDP outperforms DP3 by a substantial margin and transfers effectively to real-world bimanual deployments.
翻译:基于扩散的视觉运动策略直接在原始动作空间中运行,将场景理解与轨迹生成合并到单个去噪过程中。由此产生的速度场必须同时编码场景信息并生成精确轨迹,这增加了学习复杂度,并限制了需要多臂之间精确时间协调的任务的性能。为了简化这一联合学习问题,我们引入了潜在扩散策略(LDP),这是一个两阶段框架,在有意塑造的潜在空间中进行流匹配。通过将场景理解吸收到观测条件化的CVAE编码器中,LDP集中了每个观测的条件分布。因此,流模型避免了隐式解析场景依赖结构;相反,它在具有更平滑速度场的预集中分布内生成,从而简化了从有限演示中的学习。此外,为了捕捉潜在令牌之间的时间依赖性,LDP采用逐令牌扩散强制训练,并使用阶梯式推理采样来解决由此产生的分布不匹配问题。我们还提出了重建FID(rFID)作为一个轻量级代理,仅从潜在空间统计数据预测下游任务的成功。在RoboTwin 2.0中需要协调的任务上,LDP显著优于DP3,并有效迁移到现实世界的双臂部署场景。