Foundation models have made significant strides in various applications, including text-to-image generation, panoptic segmentation, and natural language processing. This paper presents Instruct2Act, a framework that utilizes Large Language Models to map multi-modal instructions to sequential actions for robotic manipulation tasks. Specifically, Instruct2Act employs the LLM model to generate Python programs that constitute a comprehensive perception, planning, and action loop for robotic tasks. In the perception section, pre-defined APIs are used to access multiple foundation models where the Segment Anything Model (SAM) accurately locates candidate objects, and CLIP classifies them. In this way, the framework leverages the expertise of foundation models and robotic abilities to convert complex high-level instructions into precise policy codes. Our approach is adjustable and flexible in accommodating various instruction modalities and input types and catering to specific task demands. We validated the practicality and efficiency of our approach by assessing it on robotic tasks in different scenarios within tabletop manipulation domains. Furthermore, our zero-shot method outperformed many state-of-the-art learning-based policies in several tasks. The code for our proposed approach is available at https://github.com/OpenGVLab/Instruct2Act, serving as a robust benchmark for high-level robotic instruction tasks with assorted modality inputs.
翻译:基础模型在文本生成图像、全景分割及自然语言处理等多种应用中取得了显著进展。本文提出Instruct2Act框架,该框架利用大语言模型将多模态指令映射为机器人操控任务的序列化动作。具体而言,Instruct2Act通过LLM模型生成Python程序,构成机器人任务中完整的感知、规划与行动循环。在感知部分,采用预定义API调用多种基础模型:Segment Anything Model精准定位候选物体,而CLIP对其进行分类。由此,该框架融合基础模型的专长与机器人能力,将复杂高层指令转化为精确的策略代码。我们的方法灵活可调,能适配不同指令模态与输入类型,并满足特定任务需求。通过在桌面操作领域不同场景的机器人任务评估,验证了该方法的实用性与高效性。此外,我们的零样本方法在多项任务中超越了多种基于学习的先进策略。所提方法的代码已开源至https://github.com/OpenGVLab/Instruct2Act,可作为多模态输入条件下高层机器人指令任务的稳健基准。