Accomplishing household tasks requires to plan step-by-step actions considering the consequences of previous actions. However, the state-of-the-art embodied agents often make mistakes in navigating the environment and interacting with proper objects due to imperfect learning by imitating experts or algorithmic planners without such knowledge. To improve both visual navigation and object interaction, we propose to consider the consequence of taken actions by CAPEAM (Context-Aware Planning and Environment-Aware Memory) that incorporates semantic context (e.g., appropriate objects to interact with) in a sequence of actions, and the changed spatial arrangement and states of interacted objects (e.g., location that the object has been moved to) in inferring the subsequent actions. We empirically show that the agent with the proposed CAPEAM achieves state-of-the-art performance in various metrics using a challenging interactive instruction following benchmark in both seen and unseen environments by large margins (up to +10.70% in unseen env.).
翻译:完成家务任务需要规划逐步动作,并考虑先前动作的后果。然而,现有最先进的具身智能体常因模仿专家或算法规划器的不完美学习(缺乏此类知识)而在导航环境和与合适对象交互时出错。为改进视觉导航和物体交互,我们提出通过CAPEAM(上下文感知规划与环境感知记忆)来考虑已执行动作的后果。该模型在动作序列中融入语义上下文(例如,适合交互的对象),并在推断后续动作时考虑交互对象空间布局和状态的改变(例如,物体被移动到的位置)。实验表明,在具有挑战性的交互式指令跟随基准测试中,采用所提CAPEAM的智能体在可见与不可见环境中均以较大幅度在不同指标上取得最先进性能(在不可见环境中最高提升+10.70%)。