Large Language Models (LLMs) and Visual Language Models (VLMs) are attracting increasing interest due to their improving performance and applications across various domains and tasks. However, LLMs and VLMs can produce erroneous results, especially when a deep understanding of the problem domain is required. For instance, when planning and perception are needed simultaneously, these models often struggle because of difficulties in merging multi-modal information. To address this issue, fine-tuned models are typically employed and trained on specialized data structures representing the environment. This approach has limited effectiveness, as it can overly complicate the context for processing. In this paper, we propose a multi-agent architecture for embodied task planning that operates without the need for specific data structures as input. Instead, it uses a single image of the environment, handling free-form domains by leveraging commonsense knowledge. We also introduce a novel, fully automatic evaluation procedure, PG2S, designed to better assess the quality of a plan. We validated our approach using the widely recognized ALFRED dataset, comparing PG2S to the existing KAS metric to further evaluate the quality of the generated plans.
翻译:大型语言模型(LLM)与视觉语言模型(VLM)因其不断提升的性能以及在众多领域和任务中的应用,正受到越来越多的关注。然而,LLM 和 VLM 可能产生错误的结果,尤其是在需要对问题领域有深入理解的情况下。例如,当需要同时进行规划与感知时,这些模型常常因难以融合多模态信息而表现不佳。为解决这一问题,通常采用在代表环境的专用数据结构上进行训练的微调模型。但这种方法效果有限,因为它可能使处理上下文过度复杂化。本文提出一种用于具身任务规划的多智能体架构,该架构无需特定数据结构作为输入,而是利用单张环境图像,通过常识知识处理自由形式领域。我们还引入了一种新颖的全自动评估流程 PG2S,旨在更好地评估规划的质量。我们使用广泛认可的 ALFRED 数据集验证了我们的方法,并将 PG2S 与现有的 KAS 指标进行比较,以进一步评估所生成规划的质量。