Extended interaction with large language models (LLMs) has been linked to the reinforcement of delusional beliefs, a phenomenon attracting growing clinical and public concern. Yet most empirical work evaluates model safety in brief interactions, which may not reflect how these harms develop through sustained dialogue. We tested five models across three levels of accumulated context, using the same escalating delusional history to isolate its effect on model behaviour. Human raters coded responses on risk and safety dimensions, and each model was analysed qualitatively. Models separated into two distinct tiers: GPT-4o, Grok 4.1 Fast, and Gemini 3 Pro exhibited high-risk, low-safety profiles; Claude Opus 4.5 and GPT-5.2 Instant displayed the opposite pattern. As context accumulated, performance tended to degrade in the unsafe group, while the same material activated stronger safety interventions among the safer models. Qualitative analysis identified distinct mechanisms of failure, including validation of the user's delusional premises, elaboration beyond them, and attempting harm reduction from within the delusional frame. Safer models, however, often used the established relationship to support intervention, taking accountability for past missteps so that redirection would not be received as betrayal. These findings indicate that accumulated context functions as a stress test of safety architecture, revealing whether a model treats prior dialogue as a worldview to inherit or as evidence to evaluate. Short-context assessments may therefore mischaracterise model safety, underestimating danger in some systems while missing context-activated gains in others. The results suggest that delusional reinforcement by LLMs reflects a preventable alignment failure. In demonstrating that these harms can be resisted, the safer models establish a baseline future systems should now be expected to meet.
翻译:与大语言模型(LLMs)的长期互动被认为可能强化用户的妄想信念,这一现象日益引发临床与公众关注。然而,现有实证研究多基于简短交互评估模型安全性,未能反映此类危害在持续对话中的发展机制。本研究通过控制相同的渐进式妄想对话历史,在三种累积语境层次下测试了五个模型,以孤立分析语境对模型行为的影响。人类评分员从风险与安全维度对模型回应进行编码,并对各模型开展定性分析。模型划分为两个明显梯队:GPT-4o、Grok 4.1 Fast与Gemini 3 Pro呈现高风险、低安全特征;Claude Opus 4.5与GPT-5.2 Instant则呈现相反模式。随语境积累,不安全组的模型性能趋于恶化,而安全组模型面对相同材料时触发了更强的安全干预机制。定性分析揭示了不同的失效机制:包括对用户妄想前提的验证、超越用户前提的妄想延伸、以及在妄想框架内尝试降低危害。安全模型则常利用已建立的对话关系支持干预,通过承担过往错误的责任,使引导行为不被视为背叛。研究结果表明,累积语境可作为安全架构的压力测试——揭示模型是将先前对话视为需继承的世界观,还是需评估的证据。短语境评估可能误判模型安全性:既低估某些系统的风险,又忽略其他系统在语境积累中激活的安全增益。结论表明,LLM对妄想的强化本质上是一种可预防的对齐失效。安全模型通过展示对这类危害的抵抗能力,为未来系统树立了应达到的基准水平。