LLM agents increasingly operate in open-ended environments spanning hundreds of sequential episodes, yet they remain largely stateless: each task is solved from scratch without converting past experience into better future behavior. The central obstacle is not \emph{what} to remember but \emph{how to use} what has been remembered, including which retrieval policy to apply, how to interpret prior outcomes, and when the current strategy itself must change. We introduce \emph{Agent Evolving Learning} (\ael{}), a two-timescale framework that addresses this obstacle. At the fast timescale, a Thompson Sampling bandit learns which memory retrieval policy to apply at each episode; at the slow timescale, LLM-driven reflection diagnoses failure patterns and injects causal insights into the agent's decision prompt, giving it an interpretive frame for the evidence it retrieves. On a sequential portfolio benchmark (10 sector-diverse tickers, 208 episodes, 5 random seeds), \ael{} achieves a Sharpe ratio of 2.13$\pm$0.47, outperforming five published self-improving methods and all non-LLM baselines while maintaining the lowest variance among all LLM-based approaches. A nine-variant ablation reveals a ``less is more'' pattern: memory and reflection together produce a 58\% cumulative improvement over the stateless baseline, yet every additional mechanism we test (planner evolution, per-tool selection, cold-start initialization, skill extraction, and three credit assignment methods) \emph{degrades} performance. This demonstrates that the bottleneck in agent self-improvement is \emph{self-diagnosing how to use} experience rather than adding architectural complexity. Code and data: https://github.com/WujiangXu/AEL.
翻译:大语言模型(LLM)智能体越来越多地在涵盖数百个连续回合的开放环境中运行,但它们仍基本保持无状态:每个任务都从头开始解决,无法将过往经验转化为更优的未来行为。核心障碍不在于“记住什么”,而在于“如何运用”已记住的内容,包括应用何种检索策略、如何解释先前结果,以及何时必须改变当前策略本身。我们提出“智能体演化学习”(AEL),一个双时间尺度框架以解决该障碍。在快时间尺度上,汤普森采样的赌博机学习在每个回合中应用哪种记忆检索策略;在慢时间尺度上,LLM驱动的反思机制诊断失败模式,并将因果洞见注入智能体的决策提示中,赋予其对所检索证据的解释框架。在序列投资组合基准测试(10只行业多样化股票,208个回合,5个随机种子)上,AEL实现了2.13±0.47的夏普比率,优于五种已发表的自我改进方法和所有非LLM基线,同时在所有基于LLM的方法中方差最低。一个九变量消融实验揭示了“少即是多”的模式:记忆与反思共同带来比无状态基线58%的累积改进,但每项额外测试的机制(规划器演化、逐工具选择、冷启动初始化、技能抽取及三种信用分配方法)均会“降低”性能。这表明智能体自我改进的瓶颈在于“自我诊断如何运用”经验,而非增加架构复杂度。代码与数据:https://github.com/WujiangXu/AEL。