Despite great strides in language-guided manipulation, existing work has been constrained to table-top settings. Table-tops allow for perfect and consistent camera angles, properties are that do not hold in mobile manipulation. Task plans that involve moving around the environment must be robust to egocentric views and changes in the plane and angle of grasp. A further challenge is ensuring this is all true while still being able to learn skills efficiently from limited data. We propose Spatial-Language Attention Policies (SLAP) as a solution. SLAP uses three-dimensional tokens as the input representation to train a single multi-task, language-conditioned action prediction policy. Our method shows an 80% success rate in the real world across eight tasks with a single model, and a 47.5% success rate when unseen clutter and unseen object configurations are introduced, even with only a handful of examples per task. This represents an improvement of 30% over prior work (20% given unseen distractors and configurations). We see a 4x improvement over baseline in mobile manipulation setting. In addition, we show how SLAPs robustness allows us to execute Task Plans from open-vocabulary instructions using a large language model for multi-step mobile manipulation. For videos, see the website: https://robotslap.github.io
翻译:尽管语言引导的操作取得了巨大进展,但现有工作仍局限于桌面场景。桌面环境允许完美且一致的相机视角,而这一特性在移动操作中并不成立。涉及在环境中移动的任务规划必须对自我中心视角以及抓取平面和角度的变化具有鲁棒性。另一个挑战是确保在从有限数据中高效学习技能的同时,所有这些特性依然成立。我们提出空间语言注意力策略(SLAP)作为解决方案。SLAP使用三维标记作为输入表示,训练单个多任务、语言条件化的动作预测策略。我们的方法在现实世界的八个任务中,使用单一模型实现了80%的成功率,即使在每个任务仅有少量示例的情况下引入未见过的杂乱物体和未知物体配置时,仍能达到47.5%的成功率。相较于先前工作,这实现了30%的提升(在未见干扰物和配置下提升20%)。在移动操作设置中,我们观察到相较于基线方法有4倍的改进。此外,我们展示了SLAP的鲁棒性如何使我们能够利用大语言模型根据开放词汇指令执行多步骤移动操作任务规划。相关视频请访问网站:https://robotslap.github.io