APEX-EM: Non-Parametric Online Learning for Autonomous Agents via Structured Procedural-Episodic Experience Replay

LLM-based autonomous agents lack persistent procedural memory: they re-derive solutions from scratch even when structurally identical tasks have been solved before. We present \textbf{APEX-EM}, a non-parametric online learning framework that accumulates, retrieves, and reuses structured procedural plans without modifying model weights. APEX-EM introduces: (1) a \emph{structured experience representation} encoding the full procedural-episodic trace of each execution -- planning steps, artifacts, iteration history with error analysis, and quality scores; (2) a \emph{Plan-Retrieve-Generate-Iterate-Ingest} (PRGII) workflow with Task Verifiers providing multi-dimensional reward signals; and (3) a \emph{dual-outcome Experience Memory} with hybrid retrieval combining semantic search, structural signature matching, and plan DAG traversal -- enabling cross-domain transfer between tasks sharing no lexical overlap but analogous operational structure. Successful experiences serve as positive in-context examples; failures as negative examples with structured error annotations. We evaluate on BigCodeBench~\cite{zhuo2025bigcodebench}, KGQAGen-10k~\cite{zhang2025kgqagen}, and Humanity's Last Exam~\cite{phan2025hle} using Claude Sonnet 4.5 and Opus 4.5. On KGQAGen-10k, APEX-EM achieves 89.6\% accuracy versus 41.3\% without memory (+48.3pp), surpassing the oracle-retrieval upper bound (84.9\%). On BigCodeBench, it reaches 83.3\% SR from a 53.9\% baseline (+29.4pp), exceeding MemRL's~\cite{memrl2025} +11.0pp gain under comparable frozen-backbone conditions (noting backbone differences controlled for in our analysis). On HLE, entity graph retrieval reaches 48.0\% from 25.2\% (+22.8pp). Ablations show component value is task-dependent: rich judge feedback is negligible for code generation but critical for structured queries (+10.3pp), while binary-signal iteration partially compensates for weaker feedback.

翻译：基于LLM的自主代理缺乏持续的程序记忆：即使面对之前已解决的结构相同任务，它们也会从头重新推导解决方案。我们提出\textbf{APEX-EM}，一种无需修改模型权重的非参数在线学习框架，用于累积、检索和重用结构化程序计划。APEX-EM引入：(1)一种\emph{结构化经验表示}，编码每次执行的完整程序-情景轨迹——包括规划步骤、工件、带错误分析的迭代历史及质量评分；(2)一种包含任务验证器提供多维奖励信号的\emph{Plan-Retrieve-Generate-Iterate-Ingest}（PRGII）工作流；(3)一种结合语义搜索、结构特征匹配和计划DAG遍历的混合检索的\emph{双结果经验记忆}——支持在无词汇重叠但操作结构相似的任务间进行跨域迁移。成功经验作为正面上下文示例；失败经验则作为附带结构化错误注释的负面示例。我们使用Claude Sonnet 4.5和Opus 4.5在BigCodeBench~\cite{zhuo2025bigcodebench}、KGQAGen-10k~\cite{zhang2025kgqagen}和Humanity's Last Exam~\cite{phan2025hle}上进行了评估。在KGQAGen-10k上，APEX-EM达到89.6%的准确率，相比之下无记忆时仅为41.3%（+48.3个百分点），甚至超越了oracle检索上限（84.9%）。在BigCodeBench上，它从53.9%的基线提升至83.3%的解决率（+29.4个百分点），在可比冻结骨干条件下（注：我们在分析中控制了骨干网络差异）超越了MemRL~\cite{memrl2025}的+11.0百分点增益。在HLE上，实体图检索从25.2%达到48.0%（+22.8个百分点）。消融实验表明组件价值依赖于具体任务：丰富的评判者反馈对代码生成影响可忽略，但对结构化查询至关重要（+10.3个百分点），而二元信号迭代可部分补偿较弱反馈。