Diffusion policies (DP) have recently shown great promise for generating actions in robotic manipulation. However, existing approaches often rely on global instructions to produce short-term control signals, which can result in misalignment in action generation. We conjecture that the primitive skills, referred to as fine-grained, short-horizon manipulations, such as ``move up'' and ``open the gripper'', provide a more intuitive and effective interface for robot learning. To bridge this gap, we propose SDP, a skill-conditioned DP that integrates interpretable skill learning with conditional action planning. SDP abstracts eight reusable primitive skills across tasks and employs a vision-language model to extract discrete representations from visual observations and language instructions. Based on them, a lightweight router network is designed to assign a desired primitive skill for each state, which helps construct a single-skill policy to generate skill-aligned actions. By decomposing complex tasks into a sequence of primitive skills and selecting a single-skill policy, SDP ensures skill-consistent behavior across diverse tasks. Extensive experiments on two challenging simulation benchmarks and real-world robot deployments demonstrate that SDP consistently outperforms SOTA methods, providing a new paradigm for skill-based robot learning with diffusion policies.
翻译:扩散策略(DP)近期在机器人操作的动作生成方面展现出巨大潜力。然而,现有方法通常依赖全局指令来产生短期控制信号,这可能导致动作生成中的错位问题。我们推测,原始技能——即细粒度、短时程的操作,如“向上移动”和“打开夹爪”——为机器人学习提供了更直观且有效的接口。为弥合这一差距,我们提出了SDP,一种技能条件化的扩散策略,它将可解释的技能学习与条件动作规划相结合。SDP抽象出跨任务的八种可复用原始技能,并利用视觉-语言模型从视觉观测和语言指令中提取离散表征。基于这些表征,我们设计了一个轻量级路由网络,为每个状态分配一个期望的原始技能,这有助于构建单技能策略以生成与技能对齐的动作。通过将复杂任务分解为一系列原始技能并选择单技能策略,SDP确保了跨多样任务的技能一致性行为。在两个具有挑战性的仿真基准测试和真实世界机器人部署上进行的大量实验表明,SDP始终优于最先进的方法,为基于技能的机器人扩散策略学习提供了新范式。