Programming robot behaviour in a complex world faces challenges on multiple levels, from dextrous low-level skills to high-level planning and reasoning. Recent pre-trained Large Language Models (LLMs) have shown remarkable reasoning ability in zero-shot robotic planning. However, it remains challenging to ground LLMs in multimodal sensory input and continuous action output, while enabling a robot to interact with its environment and acquire novel information as its policies unfold. We develop a robot interaction scenario with a partially observable state, which necessitates a robot to decide on a range of epistemic actions in order to sample sensory information among multiple modalities, before being able to execute the task correctly. An interactive perception framework is therefore proposed with an LLM as its backbone, whose ability is exploited to instruct epistemic actions and to reason over the resulting multimodal sensations (vision, sound, haptics, proprioception), as well as to plan an entire task execution based on the interactively acquired information. Our study demonstrates that LLMs can provide high-level planning and reasoning skills and control interactive robot behaviour in a multimodal environment, while multimodal modules with the context of the environmental state help ground the LLMs and extend their processing ability.
翻译:复杂世界中的机器人行为编程面临着从灵巧的低层次技能到高层次规划与推理的多个挑战。近期预训练的大语言模型在零样本机器人规划中展现出卓越的推理能力。然而,如何将大语言模型植入多模态感知输入与连续动作输出,同时使机器人能够与所处环境交互并在策略展开过程中获取新信息,仍是亟待解决的难题。我们开发了一个具有部分可观测状态的机器人交互场景,要求机器人在正确执行任务前,必须通过一系列认知行为从多个模态中采样感官信息。为此,我们提出以大语言模型为骨干的交互式感知框架,利用其能力指导认知行为、推理多模态感知结果(视觉、听觉、触觉、本体感觉),并基于交互获取的信息规划完整任务执行。研究表明,大语言模型在多模态环境中可提供高层次规划与推理能力,并控制交互式机器人行为,而结合环境状态上下文的多模态模块则有助于锚定大语言模型并拓展其处理能力。