Large Language Models (LLMs) have shown strong potential as conversational agents. Yet, their effectiveness remains limited by deficiencies in robust long-term memory, particularly in complex, long-term web-based services such as online emotional support. However, existing long-term dialogue benchmarks primarily focus on static and explicit fact retrieval, failing to evaluate agents in critical scenarios where user information is dispersed, implicit, and continuously evolving. To address this gap, we introduce ES-MemEval, a comprehensive benchmark that systematically evaluates five core memory capabilities: information extraction, temporal reasoning, conflict detection, abstention, and user modeling, in long-term emotional support settings, covering question answering, summarization, and dialogue generation tasks. To support the benchmark, we also propose EvoEmo, a multi-session dataset for personalized long-term emotional support that captures fragmented, implicit user disclosures and evolving user states. Extensive experiments on open-source long-context, commercial, and retrieval-augmented (RAG) LLMs show that explicit long-term memory is essential for reducing hallucinations and enabling effective personalization. At the same time, RAG improves factual consistency but struggles with temporal dynamics and evolving user states. These findings highlight both the potential and limitations of current paradigms and motivate more robust integration of memory and retrieval for long-term personalized dialogue systems.
翻译:大型语言模型(LLMs)作为对话代理已展现出强大潜力。然而,其在复杂、长期的网络服务(如在线情感支持)中的有效性,仍受限于鲁棒长期记忆能力的不足。现有长期对话基准主要关注静态且显性的事实检索,未能评估代理在用户信息分散、隐含且持续演变的关键场景下的表现。为填补这一空白,我们提出了ES-MemEval,这是一个综合性基准,旨在长期情感支持场景中系统评估五大核心记忆能力:信息提取、时序推理、冲突检测、信息弃用与用户建模,涵盖问答、摘要和对话生成任务。为支撑该基准,我们同时提出了EvoEmo——一个面向个性化长期情感支持的多轮对话数据集,该数据集捕捉了碎片化、隐含的用户自我披露以及动态演变的用户状态。在开源长上下文模型、商业模型及检索增强生成(RAG)LLMs上的大量实验表明,显式的长期记忆对于减少幻觉并实现有效个性化至关重要。同时,RAG虽能提升事实一致性,但在处理时序动态与演变用户状态方面仍面临挑战。这些发现揭示了当前范式的潜力与局限,并推动了对长期个性化对话系统中记忆与检索机制进行更鲁棒整合的研究。