LLM-powered agents face a persistent challenge: learning from their execution experiences to improve future performance. While agents can successfully complete many tasks, they often repeat inefficient patterns, fail to recover from similar errors, and miss opportunities to apply successful strategies from past executions. We present a novel framework for automatically extracting actionable learnings from agent execution trajectories and utilizing them to improve future performance through contextual memory retrieval. Our approach comprises four components: (1) a Trajectory Intelligence Extractor that performs semantic analysis of agent reasoning patterns, (2) a Decision Attribution Analyzer that identifies which decisions and reasoning steps led to failures, recoveries, or inefficiencies, (3) a Contextual Learning Generator that produces three types of guidance -- strategy tips from successful patterns, recovery tips from failure handling, and optimization tips from inefficient but successful executions, and (4) an Adaptive Memory Retrieval System that injects relevant learnings into agent prompts based on multi-dimensional similarity. Unlike existing memory systems that store generic conversational facts, our framework understands execution patterns, extracts structured learnings with provenance, and retrieves guidance tailored to specific task contexts. Evaluation on the AppWorld benchmark demonstrates consistent improvements, with up to 14.3 percentage point gains in scenario goal completion on held-out tasks and particularly strong benefits on complex tasks (28.5~pp scenario goal improvement, a 149\% relative increase).
翻译:LLM驱动的智能体面临一个持续存在的挑战:如何从执行经验中学习以提升未来性能。尽管智能体能够成功完成许多任务,但它们经常重复低效模式,难以从类似错误中恢复,且错失应用过往成功策略的机会。本文提出一种新颖框架,能够自动从智能体执行轨迹中提取可操作的认知,并通过上下文记忆检索机制利用这些认知来提升未来性能。该框架包含四个核心组件:(1) 轨迹智能提取器:对智能体推理模式进行语义分析;(2) 决策归因分析器:识别导致失败、恢复或低效现象的具体决策与推理步骤;(3) 上下文学习生成器:产生三类指导——从成功模式中提炼的策略提示、从失败处理中总结的恢复提示、以及从低效但成功的执行中提取的优化提示;(4) 自适应记忆检索系统:基于多维相似度为智能体提示注入相关认知。与现有仅存储通用对话事实的记忆系统不同,本框架能够理解执行模式,提取具有溯源的结构化认知,并针对具体任务上下文检索定制化指导。在AppWorld基准测试上的评估显示,该方法实现了持续性能提升:在保留任务上场景目标完成率最高提升14.3个百分点,在复杂任务上表现尤为显著(场景目标提升28.5个百分点,相对提升达149%)。