Large language models (LLMs) have emerged as effective action policies for sequential decision-making (SDM) tasks due to their extensive prior knowledge. However, this broad yet general knowledge is often insufficient for specific decision-making tasks with limited task-related data, making it challenging to efficiently adapt LLMs to specific SDM tasks. To address this challenge, we propose a memory-driven self-improvement framework that combines LLM general prior knowledge with a compact memory of domain-specific experiences. Memory retains past interactions and associated Q-values, thereby capturing decision-relevant knowledge that facilitates accurate value estimation and informs the LLM prior refinement. The refined LLM prior, in turn, generates higher-reward trajectories that further enrich memory, forming a natural self-improvement framework where memory and LLM prior mutually reinforce each other. Experiments show that our memory-driven approach significantly outperforms both traditional RL and LLM-based baselines, e.g., improving performance by over 40\% on in-distribution tasks and over 75\% when generalized to unseen tasks in ALFWorld.
翻译:大语言模型凭借其丰富的先验知识,已成为序列决策任务中有效的行动策略。然而,这种广泛而通用的知识往往不足以应对任务相关数据有限的特定决策任务,使得将大语言模型高效适配到特定序列决策任务面临挑战。为应对这一挑战,我们提出一种记忆驱动的自改进框架,将大语言模型的通用先验知识与特定领域的紧凑经验记忆相结合。记忆模块保留历史交互记录及对应的Q值,从而捕获决策相关知识,以支持精确的价值估计并指导大语言模型先验知识的优化。优化后的大语言模型先验能够生成更高奖励的轨迹,进一步丰富记忆内容,由此形成记忆模块与大语言模型先验相互促进的自改进闭环。实验表明,我们的记忆驱动方法显著优于传统强化学习与大语言模型基线方法,例如在分布内任务上性能提升超过40%,在ALFWorld未见任务上的泛化性能提升超过75%。