Evaluating Long-Horizon Memory for Multi-Party Collaborative Dialogues

Long-term conversational memory in practical LLM applications is inherently collaborative: information is produced by multiple participants, scattered across groups and channels, revised over time, and implicitly grounded in roles and social context. Yet there is currently no established benchmark that evaluates memory under interaction patterns resembling real-world deployment, as existing benchmarks largely focus on dyadic or single-topic dialogues. In this paper, we introduce EverMemBench, the first benchmark designed for long-horizon collaborative memory, built from multi-party, multi-group conversations spanning over one million tokens with dense cross-topic interleaving, temporally evolving decisions, and role-conditioned personas. EverMemBench evaluates memory systems using 2400 QA pairs across three dimensions essential for real applications: fine-grained recall, memory awareness, and user profile understanding. Our evaluation reveals fundamental limitations of current systems: multi-hop reasoning collapses under multi-party attribution even with oracle evidence (26% accuracy), temporal reasoning fails without explicit version semantics beyond timestamps, and memory awareness is bottlenecked by retrieval, as similarity-based methods miss implicitly relevant information. EverMemBench thus represents a concrete step toward realistic evaluation of LLM memory and a cornerstone benchmark for developing next-generation LLMs that reason over time, roles, and collaborative interaction structure. Our benchmark and code are publicly available at https://github.com/EverMind-AI/EverMemBench.

翻译：实用LLM应用中的长期对话记忆本质上是协作性的：信息由多个参与者产生，分散在不同群组和通道中，随时间推移不断修订，并隐式地基于角色和社会背景。然而，目前尚无成熟的基准测试能在模拟真实世界部署的交互模式下评估记忆能力，现有基准主要集中于双人或单一主题对话。本文提出EverMemBench——首个专为长时程协作记忆设计的基准测试，构建于超过百万标记的多参与者、多群组对话之上，具有密集的跨主题交错、随时间演变的决策以及角色条件化的人物设定。EverMemBench通过2400个问答对从三个对实际应用至关重要的维度评估记忆系统：细粒度回忆、记忆感知和用户画像理解。我们的评估揭示了当前系统的根本局限：即使提供理想证据，多跳推理在多参与者归因下仍会崩溃（准确率26%）；缺乏时间戳之外显式版本语义时，时序推理完全失效；记忆感知受检索机制制约，基于相似度的方法会遗漏隐式相关信息。因此，EverMemBench标志着向LLM记忆真实评估迈出的坚实一步，并为开发能够跨时间、角色和协作交互结构进行推理的新一代LLM奠定了基准基石。我们的基准测试与代码已公开于https://github.com/EverMind-AI/EverMemBench。