Language models (LMs) have demonstrated their capability in possessing commonsense knowledge of the physical world, a crucial aspect of performing tasks in everyday life. However, it remains unclear whether they have the capacity to generate grounded, executable plans for embodied tasks. This is a challenging task as LMs lack the ability to perceive the environment through vision and feedback from the physical environment. In this paper, we address this important research question and present the first investigation into the topic. Our novel problem formulation, named G-PlanET, inputs a high-level goal and a data table about objects in a specific environment, and then outputs a step-by-step actionable plan for a robotic agent to follow. To facilitate the study, we establish an evaluation protocol and design a dedicated metric, KAS, to assess the quality of the plans. Our experiments demonstrate that the use of tables for encoding the environment and an iterative decoding strategy can significantly enhance the LMs' ability in grounded planning. Our analysis also reveals interesting and non-trivial findings.
翻译:语言模型(LMs)已展现出掌握物理世界常识性知识的能力,这是执行日常任务的关键方面。然而,它们能否为具身任务生成基于现实环境的、可执行的规划仍不明确。由于LMs缺乏通过视觉和环境物理反馈感知环境的能力,这一任务具有挑战性。本文针对这一重要研究问题,首次开展了相关探索。我们提出名为G-PlanET的新型问题形式化方法,其输入高层目标及特定环境中对象的数表,输出可供机器人代理逐步执行的行动方案。为便于研究,我们建立了评估协议并设计专用度量指标KAS以评估规划质量。实验表明,采用数表编码环境及迭代解码策略可显著增强LMs在基础规划中的能力。此外,分析揭示了有趣且非平凡的研究发现。