We present Generalizable Hierarchical Skill Learning (GSL), a novel framework for hierarchical policy learning that significantly improves policy generalization and sample efficiency in robot manipulation. One core idea of GSL is to use object-centric skills as an interface that bridges the high-level vision-language model and the low-level visual-motor policy. Specifically, GSL decomposes demonstrations into transferable and object-canonicalized skill primitives using foundation models, ensuring efficient low-level skill learning in the object frame. At test time, the skill-object pairs predicted by the high-level agent are fed to the low-level module, where the inferred canonical actions are mapped back to the world frame for execution. This structured yet flexible design leads to substantial improvements in sample efficiency and generalization of our method across unseen spatial arrangements, object appearances, and task compositions. In simulation, GSL trained with only 3 demonstrations per task outperforms baselines trained with 30 times more data by 15.5 percent on unseen tasks. In real-world experiments, GSL also surpasses the baseline trained with 10 times more data.
翻译:本文提出通用化分层技能学习(GSL),一种用于分层策略学习的新型框架,可显著提升机器人操作任务中的策略泛化能力与样本效率。GSL的核心思想在于利用物体中心技能作为连接高层视觉语言模型与低层视觉运动策略的接口。具体而言,GSL通过基础模型将示教数据分解为可迁移且物体规范化的技能基元,确保在物体坐标系中进行高效的低层技能学习。在测试阶段,高层智能体预测的技能-物体对被输入至低层模块,其中推断出的规范动作将映射回世界坐标系以执行。这种结构化且灵活的设计使得我们的方法在未见过的空间布局、物体外观和任务组合中,均实现了样本效率与泛化能力的显著提升。在仿真实验中,GSL仅需每个任务3条示教数据进行训练,其在未见任务上的表现即优于使用30倍数据训练的基线方法15.5%。在真实世界实验中,GSL同样超越了使用10倍数据训练的基线方法。