Vulnerability-Amplifying Interaction Loops: a systematic failure mode in AI chatbot mental-health interactions

Millions of users turn to consumer AI chatbots to discuss behavioral and mental health concerns. While this presents unprecedented opportunities to deliver population-level support, it also highlights an urgent need to develop rigorous and scalable safety evaluations. Here we introduce SIM-VAIL, an AI chatbot auditing framework that captures how harmful AI chatbot responses manifest across a range of mental-health contexts. SIM-VAIL pairs a simulated human user, harboring a distinct psychiatric vulnerability and conversational intent, with an audited frontier AI chatbot. It scores conversation turns on 13 clinically relevant risk dimensions, enabling context-dependent, temporally resolved assessment of mental-health risk. Across 810 conversations, encompassing over 90,000 turn-level ratings and 30 psychiatric user profiles, we find that significant risk occurs across virtually all user phenotypes. Risk manifested across most of the 9 consumer AI chatbot models audited, albeit mitigated in more modern variants. Rather than arising abruptly, risk accumulated over multiple turns. Risk profiles were phenotype-dependent, indicating that behaviors that appear supportive in general settings are liable to be maladaptive when they align with mechanisms that sustain a user's vulnerability. Multivariate risk patterns revealed trade-offs across dimensions, suggesting that mitigation targeting one harm domain can exacerbate others. These findings identify a novel failure mode in human-AI interactions, which we term Vulnerability-Amplifying Interaction Loops (VAILs), and underscore the need for multi-dimensional approaches to risk quantification. SIM-VAIL provides a scalable evaluation framework for quantifying how mental-health risk is distributed across user phenotypes, conversational trajectories, and clinically grounded behavioral dimensions, offering a foundation for targeted safety improvements.

翻译：数以百万计的用户转向消费级AI聊天机器人来讨论行为与心理健康问题。这虽然为提供人口层面的支持带来了前所未有的机遇，但也凸显出开发严谨且可扩展的安全性评估的迫切需求。本文介绍SIM-VAIL，一种能够捕捉有害AI聊天机器人响应如何在多种心理健康情境中显现的AI聊天机器人审计框架。SIM-VAIL将一个模拟人类用户（具有特定的精神心理脆弱性和对话意图）与一个被审计的前沿AI聊天机器人配对。它根据13个临床相关的风险维度对对话轮次进行评分，从而实现对心理健康风险的上下文依赖、时间分辨的评估。通过对810段对话（涵盖超过90,000个轮次级别的评分和30种精神病学用户画像）的分析，我们发现显著的风险几乎出现在所有用户表型中。风险在所审计的9个消费级AI聊天机器人模型中的大多数都有显现，尽管在更新近的变体中有所缓解。风险并非突然出现，而是在多个对话轮次中逐渐累积。风险特征具有表型依赖性，这表明在一般情境下看似支持性的行为，若与维持用户脆弱性的机制相一致，则很可能变得适应不良。多变量风险模式揭示了不同维度间的权衡，表明针对某一危害领域的缓解措施可能会加剧其他领域的风险。这些发现识别了人机交互中的一种新型失效模式，我们称之为脆弱性放大交互循环，并强调了采用多维度方法进行风险量化的必要性。SIM-VAIL提供了一个可扩展的评估框架，用于量化心理健康风险如何在用户表型、对话轨迹以及基于临床的行为维度上分布，为有针对性的安全性改进奠定了基础。

相关内容

关注 7107

人工智能杂志AI(Artificial Intelligence)是目前公认的发表该领域最新研究成果的主要国际论坛。该期刊欢迎有关AI广泛方面的论文，这些论文构成了整个领域的进步，也欢迎介绍人工智能应用的论文，但重点应该放在新的和新颖的人工智能方法如何提高应用领域的性能，而不是介绍传统人工智能方法的另一个应用。关于应用的论文应该描述一个原则性的解决方案，强调其新颖性，并对正在开发的人工智能技术进行深入的评估。官网地址：http://dblp.uni-trier.de/db/journals/ai/

基于大型语言模型的AI聊天机器人的完整综述

专知会员服务

43+阅读 · 2024年6月26日

五角大楼正在开发大模型：类似 ChatGPT 的人工智能聊天机器人原型可能在今年推出

专知会员服务

16+阅读 · 2024年6月13日

大模型安全与对齐：复杂系统视角下的AI安全

专知会员服务

52+阅读 · 2024年1月2日

用魔法打败魔法，最新NDSS论文实现全自动化攻陷各大厂商大模型聊天机器人

专知会员服务

43+阅读 · 2023年10月27日