Physical Reasoning and Object Planning for Household Embodied Agents

In this study, we explore the sophisticated domain of task planning for robust household embodied agents, with a particular emphasis on the intricate task of selecting substitute objects. We introduce the CommonSense Object Affordance Task (COAT), a novel framework designed to analyze reasoning capabilities in commonsense scenarios. This approach is centered on understanding how these agents can effectively identify and utilize alternative objects when executing household tasks, thereby offering insights into the complexities of practical decision-making in real-world environments.Drawing inspiration from human decision-making, we explore how large language models tackle this challenge through three meticulously crafted commonsense question-and-answer datasets, featuring refined rules and human annotations. Our evaluation of state-of-the-art language models on these datasets sheds light on three pivotal considerations: 1) aligning an object's inherent utility with the task at hand, 2) navigating contextual dependencies (societal norms, safety, appropriateness, and efficiency), and 3) accounting for the current physical state of the object. To maintain accessibility, we introduce five abstract variables reflecting an object's physical condition, modulated by human insights to simulate diverse household scenarios. Our contributions include insightful Object-Utility mappings addressing the first consideration and two extensive QA datasets (15k and 130k questions) probing the intricacies of contextual dependencies and object states. The datasets, along with our findings, are accessible at: \url{https://github.com/com-phy-affordance/COAT}. This research not only advances our understanding of physical commonsense reasoning in language models but also paves the way for future improvements in household agent intelligence.

翻译：本研究探索了鲁棒型家庭具身智能体任务规划这一复杂领域，尤其聚焦于选择替代物体的精细任务。我们提出了常识物体可供性任务（COAT），这是一个旨在分析常识场景下推理能力的新颖框架。该方法的核心在于理解智能体在执行家庭任务时如何有效识别并利用替代物体，从而为真实环境中实际决策的复杂性提供洞见。受人类决策过程的启发，我们通过三个精心构建的常识问答数据集（包含精细规则与人工标注）探究大语言模型如何应对这一挑战。对当前最先进语言模型在这些数据集上的评估揭示了三个关键考量因素：1）将物体的固有效用与当前任务对齐，2）处理情境依赖关系（社会规范、安全性、适宜性与效率），3）考虑物体当前的物理状态。为保持可操作性，我们引入五个反映物体物理状态的抽象变量，并通过人类洞察进行调节以模拟多样化的家庭场景。我们的贡献包括：针对第一个考量因素提出富有洞见的物体-效用映射，以及构建两个大规模问答数据集（含1.5万与13万个问题）来深入探究情境依赖与物体状态的复杂性。数据集及相关研究成果可通过以下链接获取：\url{https://github.com/com-phy-affordance/COAT}。本研究不仅推进了对语言模型中物理常识推理的理解，也为未来提升家庭智能体智能水平铺平了道路。