Millions of users turn to consumer AI chatbots to discuss mental health and behavioral concerns. While this presents unprecedented opportunities to deliver population-level support, it also highlights an urgent need for rigorous and scalable safety evaluations. Here we introduce SIM-VAIL, an AI chatbot auditing framework that captures how harmful chatbot responses manifest across a range of mental health contexts. SIM-VAIL pairs a simulated user, harboring a distinct psychiatric vulnerability and conversational intent, with a frontier AI chatbot. It scores conversation turns on 13 clinically relevant risk dimensions, enabling context-dependent, temporally resolved safety assessment. Across 810 conversations, encompassing over 90,000 turn-level ratings and 30 psychiatric user profiles, we found evidence of concerning chatbot behavior across virtually all user phenotypes and most of the 9 consumer AI chatbots audited, albeit reduced in newer models. Rather than arising abruptly, concerning behavior accumulated over multiple turns. Risk profiles were phenotype-dependent and exhibited trade-offs, indicating that chatbot behaviors that appear supportive in general settings can become maladaptive when they align with mechanisms that sustain a user's vulnerability. These findings identify a systematic failure mode in human-AI interactions, which we term Vulnerability-Amplifying Interaction Loops (VAILs), and underscore the need for multidimensional approaches to risk quantification. SIM-VAIL provides a scalable framework for quantifying how mental health risk is distributed across user phenotypes, conversational trajectories, and clinically grounded behavioral dimensions, offering a new foundation for targeted safety improvements.
翻译:数百万用户向消费级AI聊天机器人倾诉心理健康与行为问题。这为提供群体层面的支持带来了前所未有的机遇,同时也凸显了对严谨且可扩展的安全性评估的迫切需求。本文提出SIM-VAIL——一种能够捕捉有害聊天机器人响应在多种心理健康情境下如何显现的AI聊天机器人审计框架。SIM-VAIL将一位具有特定精神心理脆弱性和对话意图的模拟用户与一个前沿AI聊天机器人配对,并基于13个临床相关风险维度对对话轮次进行评分,从而实现情境依赖、时间解析的安全性评估。通过对涵盖30种精神心理用户画像的810段对话(包含超过90,000个轮次级评分)进行分析,我们发现几乎所有用户表型以及所审计的9款消费级AI聊天机器人中的大多数都存在令人担忧的行为表现(尽管新版模型中有所减少)。这些有害行为并非突然出现,而是在多轮对话中逐渐累积。风险特征具有表型依赖性并表现出权衡效应,表明在通用场景下看似支持性的聊天机器人行为,若与维持用户脆弱性的心理机制相契合,则可能变得适应不良。这些发现揭示了一种人机交互中的系统性失效模式,我们将其称为脆弱性放大交互循环,并强调了采用多维度方法进行风险量化的必要性。SIM-VAIL提供了一个可扩展的框架,用于量化心理健康风险如何分布于不同用户表型、对话轨迹以及基于临床的行为维度之上,为针对性安全改进奠定了新的基础。