Conversational-memory systems increasingly transform dialogue history into facts, summaries, timelines, and other source-linked descendants, so a single source turn can coexist with several derived memories in the same retrieval index. This raises an underspecified evaluation question: which stored form should receive retrieval credit? We show that this scoring-target choice is often left implicit and can materially change benchmark conclusions. We present TIAP, a fixed-output audit that rescores saved ranked outputs under three targets -- Raw, Source, and Canonical -- without rerunning retrieval. On LoCoMo and LongMemEval-S, switching only the credited target changes nDCG on 83.4--94.0 percent of shared queries, flips target orderings on Mem0 and MemoryOS transfer runs, and reverses parser-density recommendations. A 1,902-case semantic audit further shows that relaxed source-linked credit is fully justified only 29.2 percent of the time, despite high rubric reliability in a validation subset. These results reveal target noninvariance: conclusions about memory architectures can silently flip with a single benchmark-design choice. Conversational-memory papers should therefore define and report the scoring target explicitly.
翻译:对话记忆系统日益将对话历史转化为事实、摘要、时间线及其他与源关联的衍生内容,因此,单一源轮次可能与其多个衍生记忆共存于同一检索索引中。这引发了一个未充分明确的评估问题:哪种存储形式应获得检索信用?我们表明,这种评分目标选择通常被隐含处理,且可能实质性地改变基准测试的结论。我们提出TIAP,一种固定输出审计方法,在无需重新运行检索的情况下,对已保存的排序输出在三个目标——原始、源和规范——下重新评分。在LoCoMo和LongMemEval-S上,仅更改值得信用目标就会改变83.4%至94.0%的共享查询上的nDCG,翻转Mem0和MemoryOS迁移运行中的目标排序,并逆转解析器密度推荐。一项包含1902个案例的语义审计进一步表明,尽管验证子集具有高编码可靠性,但宽松的源关联信用仅在29.2%的情况下完全合理。这些结果揭示了目标非不变性:关于记忆架构的结论可能因单一基准设计选择而悄然逆转。因此,对话记忆论文应明确界定并报告评分目标。