Trace-Focused Diffusion Policy for Multi-Modal Action Disambiguation in Long-Horizon Robotic Manipulation

Generative model-based policies have shown strong performance in imitation-based robotic manipulation by learning action distributions from demonstrations. However, in long-horizon tasks, visually similar observations often recur across execution stages while requiring distinct actions, which leads to ambiguous predictions when policies are conditioned only on instantaneous observations, termed multi-modal action ambiguity (MA2). To address this challenge, we propose the Trace-Focused Diffusion Policy (TF-DP), a simple yet effective diffusion-based framework that explicitly conditions action generation on the robot's execution history. TF-DP represents historical motion as an explicit execution trace and projects it into the visual observation space, providing stage-aware context when current observations alone are insufficient. In addition, the induced trace-focused field emphasizes task-relevant regions associated with historical motion, improving robustness to background visual disturbances. We evaluate TF-DP on real-world robotic manipulation tasks exhibiting pronounced multi-modal action ambiguity and visually cluttered conditions. Experimental results show that TF-DP improves temporal consistency and robustness, outperforming the vanilla diffusion policy by 80.56 percent on tasks with multi-modal action ambiguity and by 86.11 percent under visual disturbances, while maintaining inference efficiency with only a 6.4 percent runtime increase. These results demonstrate that execution-trace conditioning offers a scalable and principled approach for robust long-horizon robotic manipulation within a single policy.

翻译：基于生成模型的策略通过从演示中学习动作分布，在基于模仿的机器人操作中展现出强大性能。然而，在长视界任务中，视觉上相似的观测常常在执行阶段间重复出现，却需要不同的动作，这导致当策略仅以瞬时观测为条件时会产生模糊预测，称为多模态动作歧义（MA2）。为应对这一挑战，我们提出了轨迹聚焦扩散策略（TF-DP），这是一个简单而有效的基于扩散的框架，它明确地将动作生成条件化于机器人的执行历史。TF-DP将历史运动表示为显式的执行轨迹，并将其投影到视觉观测空间，在当前观测本身不足时提供阶段感知的上下文。此外，所诱导的轨迹聚焦场强调了与历史运动相关的任务关键区域，提升了对背景视觉干扰的鲁棒性。我们在表现出显著多模态动作歧义和视觉杂乱条件的真实世界机器人操作任务上评估了TF-DP。实验结果表明，TF-DP提升了时序一致性和鲁棒性，在多模态动作歧义任务上比原始扩散策略性能高出80.56%，在视觉干扰条件下高出86.11%，同时保持了推理效率，仅带来6.4%的运行时间增加。这些结果证明，执行轨迹条件化为单策略内实现鲁棒的长视界机器人操作提供了一种可扩展且原理性的方法。