Diffusion models have demonstrated strong potential for robotic trajectory planning. However, generating coherent trajectories from high-level instructions remains challenging, especially for long-range composition tasks requiring multiple sequential skills. We propose SkillDiffuser, an end-to-end hierarchical planning framework integrating interpretable skill learning with conditional diffusion planning to address this problem. At the higher level, the skill abstraction module learns discrete, human-understandable skill representations from visual observations and language instructions. These learned skill embeddings are then used to condition the diffusion model to generate customized latent trajectories aligned with the skills. This allows generating diverse state trajectories that adhere to the learnable skills. By integrating skill learning with conditional trajectory generation, SkillDiffuser produces coherent behavior following abstract instructions across diverse tasks. Experiments on multi-task robotic manipulation benchmarks like Meta-World and LOReL demonstrate state-of-the-art performance and human-interpretable skill representations from SkillDiffuser. More visualization results and information could be found on our website.
翻译:扩散模型在机器人轨迹规划中展现出强大潜力。然而,从高层指令生成连贯的轨迹仍具挑战性,尤其对于需要多个顺序技能的长期组合任务。针对该问题,我们提出SkillDiffuser,一种结合可解释技能学习与条件扩散规划的端到端分层规划框架。在高层,技能抽象模块从视觉观察和语言指令中学习离散化、人类可理解的技能表征。这些学到的技能嵌入随后用于条件化扩散模型,生成与技能对齐的自定义潜空间轨迹,从而产生遵循可学习技能的多样化状态轨迹。通过将技能学习与条件轨迹生成相结合,SkillDiffuser能够在多种任务中根据抽象指令生成连贯行为。在Meta-World和LOReL等多任务机器人操作基准上的实验表明,SkillDiffuser实现了最先进的性能,并提供了人类可解释的技能表征。更多可视化结果和信息请访问我们的网站。