We investigate how to build and train spatial representations for robot decision making with Transformers. In particular, for robots to operate in a range of environments, we must be able to quickly train or fine-tune robot sensorimotor policies that are robust to clutter, data efficient, and generalize well to different circumstances. As a solution, we propose Spatial Language Attention Policies (SLAP). SLAP uses three-dimensional tokens as the input representation to train a single multi-task, language-conditioned action prediction policy. Our method shows 80% success rate in the real world across eight tasks with a single model, and a 47.5% success rate when unseen clutter and unseen object configurations are introduced, even with only a handful of examples per task. This represents an improvement of 30% over prior work (20% given unseen distractors and configurations).
翻译:我们研究如何利用Transformer构建和训练用于机器人决策的空间表示。具体而言,为了让机器人在多种环境中运行,必须能快速训练或微调机器人感知运动策略,这些策略需具备抗杂乱性、数据高效性及强泛化能力。为此,我们提出空间-语言注意力策略(SLAP)。SLAP采用三维令牌作为输入表示,训练单一的多任务、语言条件化的动作预测策略。本方法在真实世界的八个任务中,使用单一模型实现80%的成功率;当引入未见过的杂乱物和未见过的物体配置时,即使每个任务仅用少量样本,仍达到47.5%的成功率。相较于先前工作(在未见干扰物和配置下为20%),本方法提升了30%。