Diffusion policies (DP) have demonstrated significant potential in visual navigation by capturing diverse multi-modal trajectory distributions. However, standard imitation learning (IL), which most DP methods rely on for training, often inherits sub-optimality and redundancy from expert demonstrations, thereby necessitating a computationally intensive "generate-then-filter" pipeline that relies on auxiliary selectors during inference. To address these challenges, we propose Self-Imitated Diffusion Policy (SIDP), a novel framework that learns improved planning by selectively imitating a set of trajectories sampled from itself. Specifically, SIDP introduces a reward-guided self-imitation mechanism that encourages the policy to consistently produce high-quality trajectories efficiently, rather than outputs of inconsistent quality, thereby reducing reliance on extensive sampling and post-filtering. During training, we employ a reward-driven curriculum learning paradigm to mitigate inefficient data utility, and goal-agnostic exploration for trajectory augmentation to improve planning robustness. Extensive evaluations on a comprehensive simulation benchmark show that SIDP significantly outperforms previous methods, with real-world experiments confirming its effectiveness across multiple robotic platforms. On Jetson Orin Nano, SIDP delivers a 2.5$\times$ faster inference than the baseline NavDP, i.e., 110ms VS 273ms, enabling efficient real-time deployment.
翻译:扩散策略在视觉导航中展现出显著潜力,其能够捕捉多样化的多模态轨迹分布。然而,大多数扩散策略方法依赖的标准模仿学习训练范式,常因继承专家演示的次优性与冗余性,导致推理时需依赖辅助选择器进行高计算成本的“生成-后筛选”流程。为应对这些挑战,我们提出自模仿扩散策略——一种通过选择性模仿从自身采样的轨迹集合来学习改进规划的新型框架。具体而言,该策略引入奖励引导的自模仿机制,促使策略持续高效地生成高质量轨迹,而非输出质量不稳定的结果,从而降低对大量采样与后置过滤的依赖。在训练阶段,我们采用奖励驱动的课程学习范式以缓解数据利用效率低下的问题,并结合目标无关的探索进行轨迹增强以提升规划鲁棒性。在综合性仿真基准上的大量实验表明,自模仿扩散策略显著优于现有方法,真实世界实验进一步验证了其在多种机器人平台上的有效性。在Jetson Orin Nano平台上,自模仿扩散策略的推理速度较基线NavDP提升2.5倍(110ms对比273ms),实现了高效实时部署。