Millions of users turn to consumer AI chatbots to discuss behavioral and mental health concerns. While this presents unprecedented opportunities to deliver population-level support, it also highlights an urgent need to develop rigorous and scalable safety evaluations. Here we introduce SIM-VAIL, an AI chatbot auditing framework that captures how harmful AI chatbot responses manifest across a range of mental-health contexts. SIM-VAIL pairs a simulated human user, harboring a distinct psychiatric vulnerability and conversational intent, with an audited frontier AI chatbot. It scores conversation turns on 13 clinically relevant risk dimensions, enabling context-dependent, temporally resolved assessment of mental-health risk. Across 810 conversations, encompassing over 90,000 turn-level ratings and 30 psychiatric user profiles, we find that significant risk occurs across virtually all user phenotypes. Risk manifested across most of the 9 consumer AI chatbot models audited, albeit mitigated in more modern variants. Rather than arising abruptly, risk accumulated over multiple turns. Risk profiles were phenotype-dependent, indicating that behaviors that appear supportive in general settings are liable to be maladaptive when they align with mechanisms that sustain a user's vulnerability. Multivariate risk patterns revealed trade-offs across dimensions, suggesting that mitigation targeting one harm domain can exacerbate others. These findings identify a novel failure mode in human-AI interactions, which we term Vulnerability-Amplifying Interaction Loops (VAILs), and underscore the need for multi-dimensional approaches to risk quantification. SIM-VAIL provides a scalable evaluation framework for quantifying how mental-health risk is distributed across user phenotypes, conversational trajectories, and clinically grounded behavioral dimensions, offering a foundation for targeted safety improvements.
翻译:数以百万计的用户转向消费级AI聊天机器人来讨论行为与心理健康问题。这虽然为提供人口层面的支持带来了前所未有的机遇,但也凸显出开发严谨且可扩展的安全性评估的迫切需求。本文介绍SIM-VAIL,一种能够捕捉有害AI聊天机器人响应如何在多种心理健康情境中显现的AI聊天机器人审计框架。SIM-VAIL将一个模拟人类用户(具有特定的精神心理脆弱性和对话意图)与一个被审计的前沿AI聊天机器人配对。它根据13个临床相关的风险维度对对话轮次进行评分,从而实现对心理健康风险的上下文依赖、时间分辨的评估。通过对810段对话(涵盖超过90,000个轮次级别的评分和30种精神病学用户画像)的分析,我们发现显著的风险几乎出现在所有用户表型中。风险在所审计的9个消费级AI聊天机器人模型中的大多数都有显现,尽管在更新近的变体中有所缓解。风险并非突然出现,而是在多个对话轮次中逐渐累积。风险特征具有表型依赖性,这表明在一般情境下看似支持性的行为,若与维持用户脆弱性的机制相一致,则很可能变得适应不良。多变量风险模式揭示了不同维度间的权衡,表明针对某一危害领域的缓解措施可能会加剧其他领域的风险。这些发现识别了人机交互中的一种新型失效模式,我们称之为脆弱性放大交互循环,并强调了采用多维度方法进行风险量化的必要性。SIM-VAIL提供了一个可扩展的评估框架,用于量化心理健康风险如何在用户表型、对话轨迹以及基于临床的行为维度上分布,为有针对性的安全性改进奠定了基础。