Despite great strides in language-guided manipulation, existing work has been constrained to table-top settings. Table-tops allow for perfect and consistent camera angles, properties are that do not hold in mobile manipulation. Task plans that involve moving around the environment must be robust to egocentric views and changes in the plane and angle of grasp. A further challenge is ensuring this is all true while still being able to learn skills efficiently from limited data. We propose Spatial-Language Attention Policies (SLAP) as a solution. SLAP uses three-dimensional tokens as the input representation to train a single multi-task, language-conditioned action prediction policy. Our method shows an 80% success rate in the real world across eight tasks with a single model, and a 47.5% success rate when unseen clutter and unseen object configurations are introduced, even with only a handful of examples per task. This represents an improvement of 30% over prior work (20% given unseen distractors and configurations). We see a 4x improvement over baseline in mobile manipulation setting. In addition, we show how SLAPs robustness allows us to execute Task Plans from open-vocabulary instructions using a large language model for multi-step mobile manipulation. For videos, see the website: https://robotslap.github.io
翻译:尽管语言引导的操纵取得了巨大进展,但现有工作仍局限于桌面场景。桌面场景允许完美且一致的摄像机视角,而这一特性在移动操纵中并不成立。涉及在环境中移动的任务计划必须对自我中心视角以及抓取平面和角度的变化具有鲁棒性。另一个挑战是确保在有限数据下仍能高效地学习技能。我们提出空间-语言注意力策略(SLAP)作为解决方案。SLAP使用三维标记作为输入表示,以训练一个单一的多任务、语言条件化的动作预测策略。我们的方法在现实世界中八个任务上使用单一模型实现了80%的成功率,即使在每任务仅使用少量样本的情况下,引入未见过的杂乱场景和未见过的物体配置时仍能达到47.5%的成功率。相比先前工作(在未见过干扰物和配置下为20%),这一结果提升了30%。在移动操纵环境中,我们看到相比基线有4倍的提升。此外,我们展示了SLAP的鲁棒性如何使我们能够利用大语言模型从开放词汇指令中执行多步移动操纵的任务计划。视频请见网站:https://robotslap.github.io