Every major benchmark for LLM memory systems, LoCoMo foremost, measures whether a model answered correctly, not whether the memory system retrieved correctly. A system returning its entire belief store achieves recall of 1.0 and passes answer-quality evaluation. This is the difference between a unit test and an integration test: retrieval quality must be measured in isolation from the generative model it feeds into, and no existing benchmark does this. We demonstrate that this failure persists even when entity extraction is entirely faithful. Memory baselines achieve mean retrieval precision of just 0.05 to 0.08 on cases referencing their own extractions. The failure is structural: cosine similarity over a domain-specific corpus cannot discriminate relevant beliefs from semantically proximate ones, an invariance confirmed across a 20x range in embedding model scale. Multi-turn evaluation surfaces a compounding failure; after topic drift, comparison systems allow semantic mass to bleed across turns, yielding high drift scores on re-entry. Single-turn metrics conceal this cost: Hindsight reports sub-700ms single-turn latency but exceeds 2,700ms mean per session turn, with p95 above 6,000ms. Under LLM-as-a-Judge evaluation, these failures remain invisible. We present two contributions: PrecisionMemBench, an 89-case benchmark measuring retrieval precision independently of generative models across diverse scope, mutation, and isolation assertions; and Tenure, a local-first structured belief store using multi-path BM25 with analyzer asymmetry, differential boosting, and hard scope isolation. Tenure passes 89/89 cases with mean precision 1.0 and sub-15ms retrieval latency. Comparison providers perform worse than the raw vector baseline they are built on, with zero active retrieval passes and ingestion costs of 98 to 897 seconds, failures that answer-quality benchmarks cannot detect.
翻译:衡量LLM记忆系统的各大基准(尤以LoCoMo为首),仅评估模型是否正确作答,而非记忆系统是否成功检索。若系统返回其全部信念存储,则召回率达1.0且通过答案质量评估。这本质上是单元测试与集成测试的差异:检索质量必须独立于下游生成模型进行测量,而现有基准均未实现这一要求。我们证明,即使实体提取完全忠实,这一缺陷依然存在。在引用自身提取结果的案例中,各类记忆基线方案的平均检索精度仅为0.05至0.08。其结构性根源在于:针对领域特定语料库的余弦相似度无法区分相关信念与语义邻近信念——该不变性在20倍范围的嵌入模型规模差异下均得到验证。多轮评估暴露出叠加性故障:在主题漂移后,对比系统允许语义信息跨轮次渗透,导致重入话题时产生高漂移分数。单轮指标掩盖了这一代价:Hindsight报告的单轮延迟低于700ms,但每会话轮次平均延迟超过2,700ms,p95延迟更超过6,000ms。在基于LLM的评估范式下,这些故障仍不可见。我们提出两项贡献:PrecisionMemBench——包含89个案例的基准,能在跨范围、突变与隔离断言场景下独立于生成模型测量检索精度;Tenure——基于路径的多叉BM25(采用分析器非对称、差异化增强与硬隔离机制)的本地优先结构化信念存储系统。Tenure通过全部89/89个案例,平均精度达1.0,检索延迟低于15ms。对比提供商在各自构建的原始向量基线基础上表现更差,主动检索通过数为零,数据摄取耗时98至897秒——这些故障是答案质量基准无法检测的。