Most evaluations of External Memory Module assume a static setting: memory is built offline and queried at a fixed state. In practice, memory is streaming: new facts arrive continuously, insertions interleave with retrievals, and the memory state evolves while the model is serving queries. In this regime, accuracy and cost are governed by the full memory lifecycle, which encompasses the ingestion, maintenance, retrieval, and integration of information into generation. We present Neuromem, a scalable testbed that benchmarks External Memory Modules under an interleaved insertion-and-retrieval protocol and decomposes its lifecycle into five dimensions including memory data structure, normalization strategy, consolidation policy, query formulation strategy, and context integration mechanism. Using three representative datasets LOCOMO, LONGMEMEVAL, and MEMORYAGENTBENCH, Neuromem evaluates interchangeable variants within a shared serving stack, reporting token-level F1 and insertion/retrieval latency. Overall, we observe that performance typically degrades as memory grows across rounds, and time-related queries remain the most challenging category. The memory data structure largely determines the attainable quality frontier, while aggressive compression and generative integration mechanisms mostly shift cost between insertion and retrieval with limited accuracy gain.
翻译:现有对外部记忆模块的评估大多基于静态设定:记忆离线构建并在固定状态下查询。然而实际应用中,记忆是流式的:新事实持续到达,插入操作与检索操作交错进行,记忆状态在模型处理查询时持续演化。在此机制下,准确性与成本由完整的记忆生命周期决定,涵盖信息摄取、维护、检索及生成整合的全过程。本文提出Neuromem——一个可扩展的测试平台,该平台在交错插入-检索协议下对外部记忆模块进行基准测试,并将其生命周期分解为五个维度:记忆数据结构、归一化策略、巩固策略、查询构建策略及上下文整合机制。通过使用LOCOMO、LONGMEMEVAL和MEMORYAGENTBENCH三个代表性数据集,Neuromem在共享服务栈中评估可互换的变体方案,并报告词元级F1分数及插入/检索延迟。总体而言,我们观察到随着记忆在多轮次中增长,性能通常会出现下降,而时间相关查询仍是最具挑战性的类别。记忆数据结构在很大程度上决定了可达到的质量上限,而激进的压缩策略与生成式整合机制主要在插入与检索之间转移成本,对精度提升作用有限。