Embodied agents face significant challenges when tasked with performing actions in diverse environments, particularly in generalizing across object types and executing suitable actions to accomplish tasks. Furthermore, agents should exhibit robustness, minimizing the execution of illegal actions. In this work, we present Egocentric Planning, an innovative approach that combines symbolic planning and Object-oriented POMDPs to solve tasks in complex environments, harnessing existing models for visual perception and natural language processing. We evaluated our approach in ALFRED, a simulated environment designed for domestic tasks, and demonstrated its high scalability, achieving an impressive 36.07% unseen success rate in the ALFRED benchmark and winning the ALFRED challenge at CVPR Embodied AI workshop. Our method requires reliable perception and the specification or learning of a symbolic description of the preconditions and effects of the agent's actions, as well as what object types reveal information about others. It is capable of naturally scaling to solve new tasks beyond ALFRED, as long as they can be solved using the available skills. This work offers a solid baseline for studying end-to-end and hybrid methods that aim to generalize to new tasks, including recent approaches relying on LLMs, but often struggle to scale to long sequences of actions or produce robust plans for novel tasks.
翻译:具身智能体在多样化环境中执行任务时面临重大挑战,特别是在跨物体类型泛化以及执行合适动作以完成任务方面。此外,智能体应具备鲁棒性,尽量减少非法动作的执行。本文提出了一种创新方法——以自我为中心规划,通过结合符号规划与面向对象的POMDP(部分可观测马尔可夫决策过程),并利用现有的视觉感知与自然语言处理模型,来解决复杂环境中的任务。我们在ALFRED(一个专为家务任务设计的模拟环境)中评估了该方法,展示了其高度可扩展性,在ALFRED基准测试中实现了令人瞩目的36.07%未见场景成功率,并在CVPR具身人工智能研讨会的ALFRED挑战赛中获胜。该方法需要可靠的感知能力,以及对于智能体动作前提条件和效果的符号描述(或通过学习获得),同时还需明确哪些物体类型能揭示其他物体的信息。它能够自然扩展以解决ALFRED之外的新任务,只要这些任务可用现有技能完成。本研究为端到端及混合方法(包括近期依赖大语言模型的方法)的研究提供了坚实基线,这些方法旨在泛化到新任务,但往往难以扩展至长动作序列或为新任务生成鲁棒计划。