Embodied Executable Policy Learning with Language-based Scene Summarization

Large Language models (LLMs) have shown remarkable success in assisting robot learning tasks, i.e., complex household planning. However, the performance of pretrained LLMs heavily relies on domain-specific templated text data, which may be infeasible in real-world robot learning tasks with image-based observations. Moreover, existing LLMs with text inputs lack the capability to evolve with non-expert interactions with environments. In this work, we introduce a novel learning paradigm that generates robots' executable actions in the form of text, derived solely from visual observations, using language-based summarization of these observations as the connecting bridge between both domains. Our proposed paradigm stands apart from previous works, which utilized either language instructions or a combination of language and visual data as inputs. Moreover, our method does not require oracle text summarization of the scene, eliminating the need for human involvement in the learning loop, which makes it more practical for real-world robot learning tasks. Our proposed paradigm consists of two modules: the SUM module, which interprets the environment using visual observations and produces a text summary of the scene, and the APM module, which generates executable action policies based on the natural language descriptions provided by the SUM module. We demonstrate that our proposed method can employ two fine-tuning strategies, including imitation learning and reinforcement learning approaches, to adapt to the target test tasks effectively. We conduct extensive experiments involving various SUM/APM model selections, environments, and tasks across 7 house layouts in the VirtualHome environment. Our experimental results demonstrate that our method surpasses existing baselines, confirming the effectiveness of this novel learning paradigm.

翻译：大型语言模型（LLMs）在辅助机器人学习任务（如复杂家务规划）中展现出显著成功。然而，预训练LLMs的性能严重依赖领域特定的模板化文本数据，这在基于图像观察的真实机器人学习任务中往往难以实现。此外，现有基于文本输入的LLMs缺乏通过非专家与环境交互来进化的能力。本文提出一种新颖的学习范式，通过语言场景摘要作为跨域桥梁，仅凭视觉观察即可生成可执行的机器人动作文本。该范式不同于以往单纯使用语言指令或语言与视觉数据组合作为输入的方法。此外，我们的方法无需场景的参考文本摘要，消除了人类参与学习循环的需求，使其更适用于真实机器人学习场景。本文提出的范式包含两个模块：SUM模块通过视觉观察解析环境并生成场景文本摘要，APM模块基于SUM模块提供的自然语言描述生成可执行动作策略。我们证明该方法可采用两种微调策略（模仿学习与强化学习）有效适应目标测试任务。在VirtualHome环境的7种房屋布局中，我们进行了涵盖多种SUM/APM模型选择、环境与任务的广泛实验。实验结果表明，该方法超越现有基线，验证了这种创新学习范式的有效性。