Embodied Instruction Following (EIF) is a task of planning a long sequence of sub-goals given high-level natural language instructions, such as "Rinse a slice of lettuce and place on the white table next to the fork". To successfully execute these long-term horizon tasks, we argue that an agent must consider its past, i.e., historical data, when making decisions in each step. Nevertheless, recent approaches in EIF often neglects the knowledge from historical data and also do not effectively utilize information across the modalities. To this end, we propose History-Aware Planning based on Fused Information (HAPFI), effectively leveraging the historical data from diverse modalities that agents collect while interacting with the environment. Specifically, HAPFI integrates multiple modalities, including historical RGB observations, bounding boxes, sub-goals, and high-level instructions, by effectively fusing modalities via our Mutually Attentive Fusion method. Through experiments with diverse comparisons, we show that an agent utilizing historical multi-modal information surpasses all the compared methods that neglect the historical data in terms of action planning capability, enabling the generation of well-informed action plans for the next step. Moreover, we provided qualitative evidence highlighting the significance of leveraging historical multi-modal data, particularly in scenarios where the agent encounters intermediate failures, showcasing its robust re-planning capabilities.
翻译:具身指令跟随(EIF)是一项根据高级自然语言指令(例如"冲洗一片生菜并将其放在叉子旁边的白色桌子上")规划一长串子目标的任务。为了成功执行这些长期视野任务,我们认为智能体在每一步决策时都必须考虑其过去,即历史数据。然而,EIF领域的最新方法常常忽视来自历史数据的知识,并且未能有效利用跨模态信息。为此,我们提出了基于融合信息的历史感知规划(HAPFI),它能有效利用智能体与环境交互时收集的来自不同模态的历史数据。具体而言,HAPFI通过我们提出的互注意力融合方法,有效整合了包括历史RGB观测、边界框、子目标以及高级指令在内的多种模态信息。通过多样化的对比实验,我们证明,利用历史多模态信息的智能体在动作规划能力方面超越了所有忽略历史数据的对比方法,从而能够为下一步生成信息充分的行为计划。此外,我们提供了定性证据,强调了利用历史多模态数据的重要性,特别是在智能体遇到中间失败的情况下,展示了其强大的重新规划能力。