Memory-augmented conversational agents enable personalized interactions using long-term user memory and have gained substantial traction. However, existing benchmarks primarily focus on whether agents can recall and apply user information, while overlooking whether such personalization is used appropriately. In fact, agents may overuse personal information, producing responses that feel forced, intrusive, or socially inappropriate to users. We refer to this issue as \emph{over-personalization}. In this work, we formalize over-personalization into three types: Irrelevance, Repetition, and Sycophancy, and introduce \textbf{OP-Bench} a benchmark of 1,700 verified instances constructed from long-horizon dialogue histories. Using \textbf{OP-Bench}, we evaluate multiple large language models and memory-augmentation methods, and find that over-personalization is widespread when memory is introduced. Further analysis reveals that agents tend to retrieve and over-attend to user memories even when unnecessary. To address this issue, we propose \textbf{Self-ReCheck}, a lightweight, model-agnostic memory filtering mechanism that mitigates over-personalization while preserving personalization performance. Our work takes an initial step toward more controllable and appropriate personalization in memory-augmented dialogue systems.
翻译:记忆增强型对话代理能够利用长期用户记忆实现个性化交互,并已获得广泛关注。然而,现有基准主要关注代理能否回忆并应用用户信息,却忽视了此类个性化是否被恰当使用。实际上,代理可能过度使用个人信息,生成令用户感到生硬、侵扰或社交不适宜的回复。我们将此问题称为\emph{过度个性化}。本研究将过度个性化形式化为三种类型:无关性、重复性与迎合性,并提出了\textbf{OP-Bench}——一个基于长程对话历史构建的、包含1700个已验证实例的基准测试集。利用\textbf{OP-Bench},我们对多种大语言模型及记忆增强方法进行了评估,发现引入记忆后过度个性化现象普遍存在。进一步分析表明,即使在不必要时,代理也倾向于检索并过度关注用户记忆。为解决该问题,我们提出了\textbf{Self-ReCheck}——一种轻量级、模型无关的记忆过滤机制,可在保持个性化性能的同时缓解过度个性化问题。本研究为记忆增强型对话系统中实现更可控、更适宜的个性化迈出了初步的一步。