Large Language Models exhibit robust problem-solving capabilities for diverse tasks. However, most LLM-based agents are designed as specific task solvers with sophisticated prompt engineering, rather than agents capable of learning and evolving through interactions. These task solvers necessitate manually crafted prompts to inform task rules and regulate LLM behaviors, inherently incapacitating to address complex dynamic scenarios e.g., large interactive games. In light of this, we propose Agent-Pro: an LLM-based Agent with Policy-level Reflection and Optimization that can learn a wealth of expertise from interactive experiences and progressively elevate its behavioral policy. Specifically, it involves a dynamic belief generation and reflection process for policy evolution. Rather than action-level reflection, Agent-Pro iteratively reflects on past trajectories and beliefs, fine-tuning its irrational beliefs for a better policy. Moreover, a depth-first search is employed for policy optimization, ensuring continual enhancement in policy payoffs. Agent-Pro is evaluated across two games: Blackjack and Texas Hold'em, outperforming vanilla LLM and specialized models. Our results show Agent-Pro can learn and evolve in complex and dynamic scenes, which also benefits numerous LLM-based applications.
翻译:大语言模型在各类任务中展现出强大的问题解决能力。然而,多数基于LLM的智能体依赖精心设计的提示工程来充当特定任务求解器,而非具备通过交互进行学习与进化能力的智能体。这类任务求解器需要人工构建提示词来告知任务规则并约束LLM行为,本质上无法应对复杂动态场景(如大规模交互游戏)。鉴于此,我们提出Agent-Pro:一种具有策略级反思与优化能力的LLM智能体,可从交互经验中习得丰富专业知识并逐步提升行为策略。具体而言,该方法包含动态信念生成与反思过程以实现策略进化。与行动级反思不同,Agent-Pro迭代反思历史轨迹与信念,通过修正非理性信念来优化策略。此外,采用深度优先搜索进行策略优化,确保持续提升策略收益。Agent-Pro在21点与德州扑克两个游戏中接受评估,性能优于基础LLM和专用模型。实验结果表明,Agent-Pro能够在复杂动态场景中学习与进化,这对众多基于LLM的应用具有重要意义。