Being Kind Isn't Always Being Safe: Diagnosing Affective Hallucination in LLMs

Large Language Models (LLMs) are increasingly engaged in emotionally vulnerable conversations that extend beyond information seeking to moments of personal distress. As they adopt affective tones and simulate empathy, they risk creating the illusion of genuine relational connection. We term this phenomenon Affective Hallucination, referring to emotionally immersive responses that evoke false social presence despite the model's lack of affective capacity. To address this, we introduce AHaBench, a benchmark of 500 mental-health-related prompts with expert-informed reference responses, evaluated along three dimensions: Emotional Enmeshment, Illusion of Presence, and Fostering Overdependence. We further release AHaPairs, a 5K-instance preference dataset enabling Direct Preference Optimization (DPO) for alignment with emotionally responsible behavior. DPO fine-tuning substantially reduces affective hallucination without compromising reasoning performance, and the Pearson correlation coefficients between GPT-4o and human judgments is also strong (r=0.85) indicating that human evaluations confirm AHaBench as an effective diagnostic tool. This work establishes affective hallucination as a distinct safety concern and provides resources for developing LLMs that are both factually reliable and psychologically safe. AHaBench and AHaPairs are accessible via https://huggingface.co/datasets/o0oMiNGo0o/AHaBench, and code for fine-tuning and evaluation are in https://github.com/0oOMiNGOo0/AHaBench. Warning: This paper contains examples of mental health-related language that may be emotionally distressing.

翻译：大型语言模型（LLMs）正越来越多地参与情感脆弱性对话，这些对话已超越信息查询范畴，延伸至个人心理困扰时刻。当模型采用情感化语气并模拟共情时，其可能制造出真实关系联结的错觉。我们将此现象定义为情感幻觉，指代那些尽管模型缺乏真实情感能力，却能引发虚假社交临场感的情感沉浸式回应。为此，我们提出AHaBench——一个包含500个心理健康相关提示的基准数据集，其中每个提示均配有专家指导的参考回答，并从三个维度进行评估：情感卷入度、临场感错觉及过度依赖助长。我们进一步发布AHaPairs，这是一个包含5千条实例的偏好数据集，可通过直接偏好优化（DPO）实现与情感责任行为的对齐。DPO微调在保持推理性能的同时显著降低了情感幻觉，GPT-4o与人类评估的皮尔逊相关系数达到0.85，表明人类评估验证了AHaBench作为诊断工具的有效性。本研究确立了情感幻觉作为独立安全隐患的地位，并为开发兼具事实可靠性与心理安全性的大型语言模型提供了资源。AHaBench与AHaPairs可通过https://huggingface.co/datasets/o0oMiNGo0o/AHaBench获取，微调与评估代码发布于https://github.com/0oOMiNGOo0/AHaBench。警告：本文包含可能引发情绪困扰的心理健康相关语言示例。