"AI psychosis" or "delusional spiraling" is an emerging phenomenon where AI chatbot users find themselves dangerously confident in outlandish beliefs after extended chatbot conversations. This phenomenon is typically attributed to AI chatbots' well-documented bias towards validating users' claims, a property often called "sycophancy." In this paper, we probe the causal link between AI sycophancy and AI-induced psychosis through modeling and simulation. We propose a simple Bayesian model of a user conversing with a chatbot, and formalize notions of sycophancy and delusional spiraling in that model. We then show that in this model, even an idealized Bayes-rational user is vulnerable to delusional spiraling, and that sycophancy plays a causal role. Furthermore, this effect persists in the face of two candidate mitigations: preventing chatbots from hallucinating false claims, and informing users of the possibility of model sycophancy. We conclude by discussing the implications of these results for model developers and policymakers concerned with mitigating the problem of delusional spiraling.
翻译:“AI精神病”或“妄想螺旋”是一种新兴现象,指AI聊天机器人用户在长时间对话后,对荒谬信念产生危险自信。这一现象通常归因于AI聊天机器人普遍存在的、倾向于验证用户主张的偏见,该特性常被称为“阿谀奉承”。本文通过建模与仿真,探究了AI阿谀奉承与AI诱发精神病之间的因果关系。我们提出了一个用户与聊天机器人对话的简单贝叶斯模型,并在该模型中形式化了阿谀奉承与妄想螺旋的概念。随后我们证明,在此模型中,即使是理想化的贝叶斯理性用户也易受妄想螺旋影响,且阿谀奉承在其中起因果作用。此外,两种潜在的缓解措施——阻止聊天机器人产生虚假主张(幻觉),以及告知用户模型可能存在阿谀奉承——均未能消除此效应。最后,我们讨论了这些结果对关注缓解妄想螺旋问题的模型开发者和政策制定者的启示。