Diffusion models have demonstrated strong potential for robotic trajectory planning. However, generating coherent and long-horizon trajectories from high-level instructions remains challenging, especially for complex tasks requiring multiple sequential skills. We propose SkillDiffuser, an end-to-end hierarchical planning framework integrating interpretable skill learning with conditional diffusion planning to address this problem. At the higher level, the skill abstraction module learns discrete, human-understandable skill representations from visual observations and language instructions. These learned skill embeddings are then used to condition the diffusion model to generate customized latent trajectories aligned with the skills. It allows for generating diverse state trajectories that adhere to the learnable skills. By integrating skill learning with conditional trajectory generation, SkillDiffuser produces coherent behavior following abstract instructions across diverse tasks. Experiments on multi-task robotic manipulation benchmarks like Meta-World and LOReL demonstrate state-of-the-art performance and human-interpretable skill representations from SkillDiffuser.
翻译:扩散模型在机器人轨迹规划方面展现出强大潜力。然而,从高层指令生成连贯的长跨度轨迹仍具挑战性,特别是对于需要多个连续技能的复杂任务。我们提出SkillDiffuser——一种将可解释技能学习与条件扩散规划相结合的端到端分层规划框架。在高层,技能抽象模块从视觉观测和语言指令中学习离散、人类可理解的技能表征。这些学习到的技能嵌入随后用于条件化扩散模型,以生成与技能对齐的定制化隐式轨迹。该方法允许生成遵循可学习技能的多样化状态轨迹。通过将技能学习与条件轨迹生成相结合,SkillDiffuser能够在不同任务中生成遵循抽象指令的连贯行为。在Meta-World和LOReL等多任务机器人操作基准上的实验表明,SkillDiffuser实现了最先进的性能并产生了人类可解释的技能表征。