We present GHOST, a framework for learning visuomotor manipulation policies that generalize beyond the training distribution. GHOST factorizes control into (i) a high-level policy that predicts the next sub-goal as a distribution over 3D end-effector poses from multi-view RGB-D observations, and (ii) a low-level goal-conditioned controller that executes embodiment-specific actions. To condition image-based policies on 3D goals, we introduce a simple spatial interface that projects predicted goals into the image plane and represents them as end-effector heatmaps. Across a suite of manipulation tasks, this hierarchical factorization consistently improves performance and robustness compared to a flat Diffusion Policy. Further, we show that this hierarchical interface also makes it easy to incorporate human demonstrations without relying on (noisy) action retargeting. As sub-goals are largely embodiment-agnostic, we train the high-level policy on human video to specify how learned skills should be applied and composed, while keeping the low-level policy trained purely on robot data. This hierarchy enables adaptation to novel objects and task variations using a small number of human demonstrations.
翻译:摘要:我们提出GHOST框架,用于学习能泛化到训练分布之外的视觉运动操作策略。GHOST将控制分解为:(i)高层策略,根据多视角RGB-D观测预测下一个子目标在三维末端执行器姿态上的分布;(ii)低层目标条件控制器,执行具体具身动作。为将基于图像的策略与三维目标关联,我们引入简单空间接口,将预测目标投影至图像平面并以末端执行器热图形式表征。在一组操作任务中,这种分层分解相较于扁平扩散策略持续提升了性能与鲁棒性。进一步表明,该分层接口还能轻松融入人类示教数据而无需依赖(含噪声的)动作重定向。由于子目标很大程度上与具身形态无关,我们利用人类视频训练高层策略以指定学习技能的应用与组合方式,同时保持低层策略仅基于机器人数据训练。这种分层架构使系统能够通过少量人类示教数据适应新物体及任务变体。