Long-term memory enables large language model (LLM) agents to support personalized and sustained interactions. However, most work on personalized agents prioritizes utility and user experience, treating memory as a neutral component and largely overlooking its safety implications. In this paper, we reveal intent legitimation, a previously underexplored safety failure in personalized agents, where benign personal memories bias intent inference and cause models to legitimize inherently harmful queries. To study this phenomenon, we introduce PS-Bench, a benchmark designed to identify and quantify intent legitimation in personalized interactions. Across multiple memory-augmented agent frameworks and base LLMs, personalization increases attack success rates by 15.8%-243.7% relative to stateless baselines. We further provide mechanistic evidence for intent legitimation from internal representations space, and propose a lightweight detection-reflection method that effectively reduces safety degradation. Overall, our work provides the first systematic exploration and evaluation of intent legitimation as a safety failure mode that naturally arises from benign, real-world personalization, highlighting the importance of assessing safety under long-term personal context. WARNING: This paper may contain harmful content.
翻译:长期记忆使大型语言模型(LLM)代理能够支持个性化和持续的交互。然而,大多数关于个性化代理的研究优先考虑实用性和用户体验,将记忆视为中性组件,很大程度上忽视了其安全影响。在本文中,我们揭示了意图合法化——一种先前未被充分探索的个性化代理安全失效模式,其中良性的个人记忆会干扰意图推断,导致模型将本质上有害的查询合法化。为了研究这一现象,我们引入了PS-Bench,这是一个旨在识别和量化个性化交互中意图合法化的基准测试。在多种记忆增强的代理框架和基础LLM中,相对于无状态基线,个性化使攻击成功率提高了15.8%至243.7%。我们进一步从内部表示空间为意图合法化提供了机制性证据,并提出了一种轻量级的检测-反思方法,能有效减少安全性退化。总体而言,我们的工作首次对意图合法化进行了系统性的探索和评估,将其视为一种由良性的、现实世界的个性化自然产生的安全失效模式,强调了在长期个性化背景下评估安全性的重要性。警告:本文可能包含有害内容。