Long-term conversational memory is a core capability for LLM-based dialogue systems, yet existing benchmarks and evaluation protocols primarily focus on surface-level factual recall. In realistic interactions, appropriate responses often depend on implicit constraints such as user state, goals, or values that are not explicitly queried later. To evaluate this setting, we introduce \textbf{LoCoMo-Plus}, a benchmark for assessing cognitive memory under cue--trigger semantic disconnect, where models must retain and apply latent constraints across long conversational contexts. We further show that conventional string-matching metrics and explicit task-type prompting are misaligned with such scenarios, and propose a unified evaluation framework based on constraint consistency. Experiments across diverse backbone models, retrieval-based methods, and memory systems demonstrate that cognitive memory remains challenging and reveals failures not captured by existing benchmarks. Our code and evaluation framework are publicly available at: https://github.com/xjtuleeyf/Locomo-Plus.
翻译:长期对话记忆是基于大语言模型(LLM)对话系统的核心能力,然而现有的基准测试和评估协议主要关注表层的事实性回忆。在实际交互中,恰当的回应往往依赖于隐含的约束条件,例如用户状态、目标或价值观,而这些信息在后续对话中并不会被明确查询。为评估此场景,我们引入了 **LoCoMo-Plus**,一个用于评估线索-触发语义脱节情境下认知记忆的基准测试。在该基准中,模型必须在长对话上下文中保持并应用潜在的约束条件。我们进一步表明,传统的字符串匹配指标和显式的任务类型提示与此类场景不匹配,并提出了一种基于约束一致性的统一评估框架。在不同骨干模型、基于检索的方法以及记忆系统上的实验表明,认知记忆仍然具有挑战性,并揭示了现有基准测试未能捕捉到的失败案例。我们的代码和评估框架已在以下网址公开:https://github.com/xjtuleeyf/Locomo-Plus。