Recent visual imitation learning systems have widely adopted multi-camera setups with wrist-mounted cameras as the de facto standard. However, manipulation from a single global view remains challenging, as the policy should capture fine-grained interaction details and identify task-relevant regions without local wrist views. To address this challenge, we present Spatially Conditioned Diffusion Policy (SCDP), a diffusion-based visuomotor policy that achieves precise and robust manipulation in a single-camera setting. Our key idea is that end-effector trajectories can serve as visual attention anchors that reflect task-relevant regions. Building on this idea, SCDP consists of two key components: (i) a visual encoder that produces multi-scale feature maps to capture both broader context and fine-grained visual features, and (ii) a spatial conditioning module that samples point-wise features along intermediate end-effector trajectories in the diffusion loop. Extensive simulation experiments show that SCDP consistently outperforms strong single-view baselines and achieves performance comparable to multi-camera baselines. Real-world experiments further demonstrate precise manipulation and robustness to visual distractors, highlighting the potential of single-camera imitation learning.
翻译:近期视觉模仿学习系统广泛采用多相机设置并将其作为事实标准,其中腕部相机尤为常见。然而,依赖单一全局视角的操纵仍具挑战性,因为策略需要在无局部腕部视角的条件下捕捉细粒度交互细节并识别任务相关区域。为解决该问题,我们提出空间条件扩散策略(SCDP)——一种基于扩散的视觉运动策略,可在单相机设置下实现精确鲁棒的操纵。我们的核心思想是:末端执行器轨迹可作为反映任务相关区域的视觉注意力锚点。基于此思想,SCDP包含两个关键组件:(i) 多尺度特征图生成视觉编码器,用于捕获全局上下文与细粒度视觉特征;(ii) 空间条件模块,在扩散循环中沿中间末端执行器轨迹进行逐点特征采样。大量仿真实验表明,SCDP持续优于强单视角基线,性能可媲美多相机基线。真实世界实验进一步验证了其对视觉干扰物的精确操纵与鲁棒性,突显了单相机模仿学习的潜力。