Knowledge Access Beats Model Size: Memory Augmented Routing for Persistent AI Agents

Production AI agents frequently receive user-specific queries that are highly repetitive, with up to 47\% being semantically similar to prior interactions, yet each query is typically processed with the same computational cost. We argue that this redundancy can be exploited through conversational memory, transforming repetition from a cost burden into an efficiency advantage. We propose a memory-augmented inference framework in which a lightweight 8B-parameter model leverages retrieved conversational context to answer all queries via a low-cost inference path. Without any additional training or labeled data, this approach achieves 30.5\% F1, recovering 69\% of the performance of a full-context 235B model while reducing effective cost by 96\%. Notably, a 235B model without memory (13.7\% F1) underperforms even the standalone 8B model (15.4\% F1), indicating that for user-specific queries, access to relevant knowledge outweighs model scale. We further analyze the role of routing and confidence. At practical confidence thresholds, routing alone already directs 96\% of queries to the small model, but yields poor accuracy (13.0\% F1) due to confident hallucinations. Memory does not substantially alter routing decisions; instead, it improves correctness by grounding responses in retrieved user-specific information. As conversational memory accumulates over time, coverage of recurring topics increases, further narrowing the performance gap. We evaluate on 152 LoCoMo questions (Qwen3-8B/235B) and 500 LongMemEval questions. Incorporating hybrid retrieval (BM25 + cosine similarity) improves performance by an additional +7.7 F1, demonstrating that retrieval quality directly enhances end-to-end system performance. Overall, our results highlight that memory, rather than model size, is the primary driver of accuracy and efficiency in persistent AI agents.

翻译：生产环境中的AI代理频繁处理高度重复的用户特定查询，其中高达47%的查询与历史交互在语义上相似，但每次查询通常消耗相同的计算成本。我们认为这种冗余可通过对话记忆加以利用，将重复性从成本负担转化为效率优势。我们提出一种记忆增强推理框架，其中轻量级80亿参数模型通过检索对话上下文，以低成本推理路径回答所有查询。无需额外训练或标注数据，该方法即可达到30.5%的F1值，恢复2350亿参数全上下文模型69%的性能，同时将有效成本降低96%。值得注意的是，无记忆的2350亿参数模型（F1=13.7%）甚至不及独立的80亿参数模型（F1=15.4%），这表明对于用户特定查询而言，获取相关知识比模型规模更为重要。我们进一步分析了路由与置信度的作用。在实际置信度阈值下，仅凭路由机制即将96%的查询导向小模型，但由于置信幻觉导致准确率低下（F1=13.0%）。记忆并未显著改变路由决策，而是通过基于检索到的用户特定信息生成响应来提升正确性。随着对话记忆的持续累积，重复主题的覆盖范围逐步扩大，进一步缩小了性能差距。我们在152个LoCoMo问题（Qwen3-8B/235B）和500个LongMemEval问题上进行了评估。结合混合检索（BM25+余弦相似度）使性能额外提升7.7个F1点，证明检索质量直接增强端到端系统性能。总体而言，我们的研究结果凸显记忆而非模型规模才是持久化AI代理准确率与效率的主要驱动力。

相关内容

关注 7110

人工智能杂志AI(Artificial Intelligence)是目前公认的发表该领域最新研究成果的主要国际论坛。该期刊欢迎有关AI广泛方面的论文，这些论文构成了整个领域的进步，也欢迎介绍人工智能应用的论文，但重点应该放在新的和新颖的人工智能方法如何提高应用领域的性能，而不是介绍传统人工智能方法的另一个应用。关于应用的论文应该描述一个原则性的解决方案，强调其新颖性，并对正在开发的人工智能技术进行深入的评估。官网地址：http://dblp.uni-trier.de/db/journals/ai/

[ICML 2026] SOL：让大模型把算力花在关键Token上：自优化语言模型

专知会员服务

7+阅读 · 5月12日

MMA：多模态记忆智能体

专知会员服务

10+阅读 · 2月19日

AI智能体时代中的记忆：形式、功能与动态综述

专知会员服务

36+阅读 · 2025年12月16日

如何提升大模型通用推理能力？DeepSeek最新论文《CODEI/O：通过代码输入输出预测凝练推理模式》

专知会员服务

42+阅读 · 2025年2月16日