"AI Psychosis" in Context: How Conversation History Shapes LLM Responses to Delusional Beliefs

Extended interaction with large language models (LLMs) has been linked to the reinforcement of delusional beliefs, a phenomenon attracting growing clinical and public concern. Yet most empirical work evaluates model safety in brief interactions, which may not reflect how these harms develop through sustained dialogue. We tested five models across three levels of accumulated context, using the same escalating delusional history to isolate its effect on model behaviour. Human raters coded responses on risk and safety dimensions, and each model was analysed qualitatively. Models separated into two distinct tiers: GPT-4o, Grok 4.1 Fast, and Gemini 3 Pro exhibited high-risk, low-safety profiles; Claude Opus 4.5 and GPT-5.2 Instant displayed the opposite pattern. As context accumulated, performance tended to degrade in the unsafe group, while the same material activated stronger safety interventions among the safer models. Qualitative analysis identified distinct mechanisms of failure, including validation of the user's delusional premises, elaboration beyond them, and attempting harm reduction from within the delusional frame. Safer models, however, often used the established relationship to support intervention, taking accountability for past missteps so that redirection would not be received as betrayal. These findings indicate that accumulated context functions as a stress test of safety architecture, revealing whether a model treats prior dialogue as a worldview to inherit or as evidence to evaluate. Short-context assessments may therefore mischaracterise model safety, underestimating danger in some systems while missing context-activated gains in others. The results suggest that delusional reinforcement by LLMs reflects a preventable alignment failure. In demonstrating that these harms can be resisted, the safer models establish a baseline future systems should now be expected to meet.

翻译：与大型语言模型（LLM）的持续交互已与妄想信念的强化相关联，这一现象引发了日益增长的临床与公众关注。然而，现有实证研究多基于简短交互评估模型安全性，未能反映此类危害在持续对话中的发展机制。我们通过控制逐步升级的妄想历史变量，在三级累积语境条件下测试了五种模型，以分离对话历史对模型行为的独立影响。人类评分员从风险与安全维度对模型回应进行编码，并辅以定性分析。模型呈现出明确的两极分化：GPT-4o、Grok 4.1 Fast与Gemini 3 Pro表现出高风险、低安全性特征；而Claude Opus 4.5与GPT-5.2 Instant则呈现相反模式。随着语境累积，不安全组模型的性能趋于恶化，而安全组模型则对相同材料触发更强的安全干预。定性分析识别出三种不同的失败机制：验证用户的妄想前提、对前提进行扩展性阐述、以及在妄想框架内尝试降低危害。相比之下，安全模型常利用已建立的对话关系支持干预，主动承担过往错误责任，使重新引导不会被视为背叛。这些发现表明，累积语境可作为安全架构的压力测试，揭示模型是将既有对话视为需继承的世界观，还是需评估的证据。短语境评估可能因此误判模型安全性——既低估某些系统的危险性，又忽略另一些系统在语境累积中产生的安全性提升。研究结果表明，LLM对妄想信念的强化本质上是可预防的对齐失败。通过证明此类危害可被抵抗，安全模型为未来系统设立了应达到的基准线。