Large language model agents rely on specialized memory systems to accumulate and reuse knowledge during extended interactions. Recent architectures typically adopt a fixed memory design tailored to specific domains, such as semantic retrieval for conversations or skills reused for coding. However, a memory system optimized for one purpose frequently fails to transfer to others. To address this limitation, we introduce M$^\star$, a method that automatically discovers task-optimized memory harnesses through executable program evolution. Specifically, M$^\star$ models an agent memory system as a memory program written in Python. This program encapsulates the data Schema, the storage Logic, and the agent workflow Instructions. We optimize these components jointly using a reflective code evolution method; this approach employs a population-based search strategy and analyzes evaluation failures to iteratively refine the candidate programs. We evaluate M$^\star$ on four distinct benchmarks spanning conversation, embodied planning, and expert reasoning. Our results demonstrate that M$^\star$ improves performance over existing fixed-memory baselines robustly across all evaluated tasks. Furthermore, the evolved memory programs exhibit structurally distinct processing mechanisms for each domain. This finding indicates that specializing the memory mechanism for a given task explores a broad design space and provides a superior solution compared to general-purpose memory paradigms.
翻译:大型语言模型代理依赖专门的记忆系统在长时间交互中积累和复用知识。近期架构通常采用针对特定领域定制的固定记忆设计,例如面向对话的语义检索或面向编程的技能复用。然而,针对某一目标优化的记忆系统往往难以迁移至其他任务。为解决这一局限,我们提出M$^\star$方法——通过可执行程序进化自动发现任务最优记忆框架。具体而言,M$^\star$将代理记忆系统建模为用Python编写的记忆程序,该程序封装了数据模式(Schema)、存储逻辑(Logic)及代理工作流指令(Instructions)。我们采用反思式代码进化方法联合优化这些组件:该方法基于种群搜索策略,通过分析评估失败案例迭代改进候选程序。我们在涵盖对话、具身规划与专家推理的四项不同基准上评估M$^\star$。结果表明,M$^\star$在所有评估任务中均稳健地提升了性能,优于现有固定记忆基线。此外,进化得到的记忆程序在不同领域展现出结构迥异的处理机制——这一发现表明,针对特定任务专门化记忆机制将探索更广阔的设计空间,并提供优于通用记忆范式的解决方案。