Less Context, More Accuracy: A Bi-Temporal Memory Engine for LLM Agents Where a Lean Retrieved Context Beats the Full History

Long-term memory is the missing layer for LLM agents: across sessions they forget, and the common workaround -- replaying the whole history into the prompt -- is expensive, slow, and, as distractors accumulate, less accurate. Most memory systems win on cost or latency but still lose to the full-context baseline on accuracy, and benchmark numbers are reported on inconsistent, non-reproducible harnesses, so one system appears at wildly different scores across sources. We present Engram, an open-source, dual-process memory engine on a bi-temporal data model. A fast write path appends lossless episodes with no LLM on the critical path; an asynchronous path extracts atomic (subject, predicate, object) facts, builds a bi-temporal knowledge graph, and resolves contradictions without an LLM call per fact -- invalidating, never deleting, so every fact keeps provenance and a supersession chain. A hybrid read path fuses dense, lexical, graph, and recency/salience signals, applies a point-in-time ("as-of") filter, and assembles a compact, provenance-tagged context. On the full 500-question LongMemEval_S, graded by the official category-specific judge, Engram's lean configuration -- answering from a ~9.6k-token retrieved slice, never the full history -- scores 83.6% vs. 73.2% for full-context (+10.4 points, McNemar p < 10^-6) at ~8x fewer tokens (9.6k vs. 79k), with 0/500 errored. The gain needs a hybrid read path: facts alone lose recall, while facts plus retrieved chunks recover detail. We also contribute a neutral, in-repo evaluation harness with the official judge baked in and the full-context baseline in every table, publish the raw per-question logs, and document the measurement-integrity pitfalls (truncation, home-grown judges, full-history leaks) that silently distort memory benchmarks. Every number ships with a command to reproduce it.

翻译：长期记忆是LLM智能体缺失的关键层：跨会话时它们会遗忘，而常见的变通方案——将完整历史重放至提示词中——不仅成本高昂、速度缓慢，且随着干扰项积累，准确性下降。大多数记忆系统在成本或延迟上占优，但准确性仍不及完整上下文基线，且基准测试结果来自不一致、不可复现的测试框架，导致同一系统在不同来源中呈现差异巨大的分数。本文提出Engram，一种基于双时序数据模型的开源双进程记忆引擎。快速写入路径在关键路径上无需LLM即可追加无损事件；异步路径提取原子化（主体、谓语、客体）事实，构建双时序知识图谱，并在无需为每个事实调用LLM的情况下解决矛盾——采用失效而非删除机制，使每个事实保留溯源和替代链。混合读取路径融合稠密、词汇、图结构以及时效/显著性信号，应用时间点（"as-of"）过滤器，组装紧凑且携带溯源标签的上下文。在包含500个问题的完整LongMemEval_S数据集上（由官方类别特定评估器评分），Engram的轻量配置——基于约9.6k token的检索片段（非完整历史）作答——取得83.6%的分数，而完整上下文基线为73.2%（+10.4个百分点，McNemar检验p<10^-6），同时token数量减少约8倍（9.6k vs 79k），且500个问题中零错误。性能提升依赖于混合读取路径：仅依赖事实会丢失召回率，而事实加检索片段可恢复细节。我们还贡献了一个中立的仓库内评估框架（内置官方评估器且在每张表格中附带完整上下文基线），公开每个问题的原始日志，并记录了静默扭曲记忆基准的测量完整性陷阱（截断、自制评估器、完整历史泄露）。每个分数均附有可复现命令。