Existing long-horizon memory benchmarks mostly use multi-turn dialogues or synthetic user histories, which makes retrieval performance an imperfect proxy for person understanding. We present \BenchName, a publicly releasable benchmark built from long-form autobiographical narratives, where actions, context, and inner thoughts provide dense evidence for inferring stable motivations and decision principles. \BenchName~reconstructs each narrative into a flashback-aware, time-anchored stream and evaluates models with evidence-linked questions spanning factual recall, subjective state attribution, and principle-level reasoning. Across diverse narrative sources, retrieval-augmented systems mainly improve factual accuracy, while errors persist on temporally grounded explanations and higher-level inferences, highlighting the need for memory mechanisms beyond retrieval. Our data is in \href{KnowMeBench}{https://github.com/QuantaAlpha/KnowMeBench}.
翻译:现有长程记忆基准大多采用多轮对话或合成用户历史,这使得检索性能难以准确衡量人物理解能力。本文提出 \BenchName,一个基于长篇自传式叙事构建的可公开发布的基准,其中行动、背景与内心思想为推断稳定的动机与决策原则提供了密集证据。\BenchName 将每段叙事重构为具有闪回感知、时间锚定的序列,并通过涵盖事实回忆、主观状态归因及原则层面推理的证据关联问题来评估模型。在不同叙事来源中,检索增强系统主要提升了事实准确性,但在时间锚定的解释与高层级推理上错误依然存在,这凸显了超越检索的记忆机制的必要性。我们的数据位于 \href{KnowMeBench}{https://github.com/QuantaAlpha/KnowMeBench}。