Robots operating in the real world require both rich manipulation skills as well as the ability to semantically reason about when to apply those skills. Towards this goal, recent works have integrated semantic representations from large-scale pretrained vision-language (VL) models into manipulation models, imparting them with more general reasoning capabilities. However, we show that the conventional pretraining-finetuning pipeline for integrating such representations entangles the learning of domain-specific action information and domain-general visual information, leading to less data-efficient training and poor generalization to unseen objects and tasks. To this end, we propose ProgramPort, a modular approach to better leverage pretrained VL models by exploiting the syntactic and semantic structures of language instructions. Our framework uses a semantic parser to recover an executable program, composed of functional modules grounded on vision and action across different modalities. Each functional module is realized as a combination of deterministic computation and learnable neural networks. Program execution produces parameters to general manipulation primitives for a robotic end-effector. The entire modular network can be trained with end-to-end imitation learning objectives. Experiments show that our model successfully disentangles action and perception, translating to improved zero-shot and compositional generalization in a variety of manipulation behaviors. Project webpage at: \url{https://progport.github.io}.
翻译:机器人需要在现实世界中具备丰富的操控技能,以及语义推理何时应用这些技能的能力。为此,近期工作将大规模预训练视觉语言模型中的语义表示整合到操控模型中,赋予其更通用的推理能力。然而,我们发现,传统上用于整合此类表示的预训练-微调流程会将领域特定的动作信息与领域通用的视觉信息的学习纠缠在一起,导致训练数据效率低下,并对未见物体和任务的泛化能力较差。为此,我们提出了ProgramPort,一种模块化方法,通过利用语言指令的句法和语义结构来更好地利用预训练的视觉语言模型。我们的框架使用语义解析器来恢复一个可执行的程序,该程序由基于不同模态的视觉和动作基础的功能模块组成。每个功能模块通过确定性计算与可学习神经网络的组合实现。程序执行生成参数,用于机器人末端执行器的通用操控原语。整个模块化网络可以通过端到端的模仿学习目标进行训练。实验表明,我们的模型成功解耦了动作与感知,从而在多种操控行为中提升了零样本泛化与组合泛化能力。项目网页:\url{https://progport.github.io}。