Owing to recent advancements, Large Language Models (LLMs) can now be deployed as agents for increasingly complex decision-making applications in areas including robotics, gaming, and API integration. However, reflecting past experiences in current decision-making processes, an innate human behavior, continues to pose significant challenges. Addressing this, we propose Retrieval-Augmented Planning (RAP) framework, designed to dynamically leverage past experiences corresponding to the current situation and context, thereby enhancing agents' planning capabilities. RAP distinguishes itself by being versatile: it excels in both text-only and multimodal environments, making it suitable for a wide range of tasks. Empirical evaluations demonstrate RAP's effectiveness, where it achieves SOTA performance in textual scenarios and notably enhances multimodal LLM agents' performance for embodied tasks. These results highlight RAP's potential in advancing the functionality and applicability of LLM agents in complex, real-world applications.
翻译:摘要:得益于最新进展,大语言模型现已可作为智能体部署于机器人技术、游戏及应用程序编程接口集成等日益复杂的决策应用中。然而,在当下决策过程中反思过往经验——这一人类与生俱来的行为——仍构成重大挑战。针对此问题,我们提出检索增强规划框架,该框架旨在动态利用与当前情境及语境相关的过往经验,从而增强智能体的规划能力。RAP的独特优势在于其通用性:它在纯文本与多模态环境中均表现优异,适用于广泛任务。实验评估证明了RAP的有效性,其在文本场景中获得最先进性能,并显著提升多模态大语言模型智能体在具身任务中的表现。这些结果突显了RAP在推进大语言模型智能体复杂真实应用场景中功能性与适用性的潜力。