Production AI agents frequently receive user-specific queries that are highly repetitive, with up to 47\% being semantically similar to prior interactions, yet each query is typically processed with the same computational cost. We argue that this redundancy can be exploited through conversational memory, transforming repetition from a cost burden into an efficiency advantage. We propose a memory-augmented inference framework in which a lightweight 8B-parameter model leverages retrieved conversational context to answer all queries via a low-cost inference path. Without any additional training or labeled data, this approach achieves 30.5\% F1, recovering 69\% of the performance of a full-context 235B model while reducing effective cost by 96\%. Notably, a 235B model without memory (13.7\% F1) underperforms even the standalone 8B model (15.4\% F1), indicating that for user-specific queries, access to relevant knowledge outweighs model scale. We further analyze the role of routing and confidence. At practical confidence thresholds, routing alone already directs 96\% of queries to the small model, but yields poor accuracy (13.0\% F1) due to confident hallucinations. Memory does not substantially alter routing decisions; instead, it improves correctness by grounding responses in retrieved user-specific information. As conversational memory accumulates over time, coverage of recurring topics increases, further narrowing the performance gap. We evaluate on 152 LoCoMo questions (Qwen3-8B/235B) and 500 LongMemEval questions. Incorporating hybrid retrieval (BM25 + cosine similarity) improves performance by an additional +7.7 F1, demonstrating that retrieval quality directly enhances end-to-end system performance. Overall, our results highlight that memory, rather than model size, is the primary driver of accuracy and efficiency in persistent AI agents.
翻译:生产环境中的AI代理频繁处理高度重复的用户特定查询,其中高达47%的查询与历史交互在语义上相似,但每次查询通常消耗相同的计算成本。我们认为这种冗余可通过对话记忆加以利用,将重复性从成本负担转化为效率优势。我们提出一种记忆增强推理框架,其中轻量级80亿参数模型通过检索对话上下文,以低成本推理路径回答所有查询。无需额外训练或标注数据,该方法即可达到30.5%的F1值,恢复2350亿参数全上下文模型69%的性能,同时将有效成本降低96%。值得注意的是,无记忆的2350亿参数模型(F1=13.7%)甚至不及独立的80亿参数模型(F1=15.4%),这表明对于用户特定查询而言,获取相关知识比模型规模更为重要。我们进一步分析了路由与置信度的作用。在实际置信度阈值下,仅凭路由机制即将96%的查询导向小模型,但由于置信幻觉导致准确率低下(F1=13.0%)。记忆并未显著改变路由决策,而是通过基于检索到的用户特定信息生成响应来提升正确性。随着对话记忆的持续累积,重复主题的覆盖范围逐步扩大,进一步缩小了性能差距。我们在152个LoCoMo问题(Qwen3-8B/235B)和500个LongMemEval问题上进行了评估。结合混合检索(BM25+余弦相似度)使性能额外提升7.7个F1点,证明检索质量直接增强端到端系统性能。总体而言,我们的研究结果凸显记忆而非模型规模才是持久化AI代理准确率与效率的主要驱动力。