Long-term conversational memory is essential for LLM-based assistants, yet existing benchmarks focus on dyadic, single-topic dialogues that fail to capture real-world complexity. We introduce EverMemBench, a benchmark featuring multi-party, multi-group conversations spanning over 1 million tokens with temporally evolving information, cross-topic interleaving, and role-specific personas. EverMemBench evaluates memory systems across three dimensions through 1,000+ QA pairs: fine-grained recall, memory awareness, and user profile understanding. Our evaluation reveals critical limitations: (1) multi-hop reasoning collapses in multi-party settings, with even oracle models achieving only 26%; (2) temporal reasoning remains unsolved, requiring version semantics beyond timestamp matching; (3) memory awareness is bottlenecked by retrieval, where current similarity-based methods fail to bridge the semantic gap between queries and implicitly relevant memories. EverMemBench provides a challenging testbed for developing next-generation memory architectures.
翻译:长期对话记忆对于基于大语言模型的智能助手至关重要,然而现有基准测试主要关注二元单主题对话,难以捕捉真实场景的复杂性。本文提出EverMemBench基准,其特点在于包含超过100万token的多参与者、多群组对话序列,涵盖时序演化信息、跨主题交错内容及角色化人物设定。该基准通过1000余组问答对从三个维度评估记忆系统:细粒度信息召回、记忆感知能力与用户画像理解。实验评估揭示出关键局限性:(1)多参与者场景下的多跳推理能力崩溃,即使采用预言模型准确率也仅达26%;(2)时序推理问题尚未解决,需要超越时间戳匹配的版本语义理解;(3)记忆感知受限于检索机制,当前基于相似度的方法难以弥合查询与隐式相关记忆之间的语义鸿沟。EverMemBench为开发新一代记忆架构提供了具有挑战性的测试平台。