Large language model-powered chatbots have transformed how people seek information, especially in high-stakes contexts like mental health. Despite their support capabilities, safe detection and response to crises such as suicidal ideation and self-harm are still unclear, hindered by the lack of unified crisis taxonomies and clinical evaluation standards. We address this by creating: (1) a taxonomy of six crisis categories; (2) a dataset of over 2,000 inputs from 12 mental health datasets, classified into these categories; and (3) a clinical response assessment protocol. We also use LLMs to identify crisis inputs and audit five models for response safety and appropriateness. First, we built a clinical-informed crisis taxonomy and evaluation protocol. Next, we curated 2,252 relevant examples from over 239,000 user inputs, then tested three LLMs for automatic classification. In addition, we evaluated five models for the appropriateness of their responses to a user's crisis, graded on a 5-point Likert scale from harmful (1) to appropriate (5). While some models respond reliably to explicit crises, risks still exist. Many outputs, especially in self-harm and suicidal categories, are inappropriate or unsafe. Different models perform variably; some, like gpt-5-nano and deepseek-v3.2-exp, have low harm rates, but others, such as gpt-4o-mini and grok-4-fast, generate more unsafe responses. All models struggle with indirect signals, default replies, and context misalignment. These results highlight the urgent need for better safeguards, crisis detection, and context-aware responses in LLMs. They also show that alignment and safety practices, beyond scale, are crucial for reliable crisis support. Our taxonomy, datasets, and evaluation methods support ongoing AI mental health research, aiming to reduce harm and protect vulnerable users.
翻译:大语言模型驱动的聊天机器人已改变了人们获取信息的方式,尤其是在心理健康等高风险情境中。尽管它们具备支持能力,但对于自杀意念、自残等危机的安全检测与回应机制仍不明确,这主要受限于缺乏统一的危机分类体系和临床评估标准。为此,我们构建了以下资源:(1) 涵盖六类危机的分类体系;(2) 基于12个心理健康数据集、包含2000余条输入的数据集,并按上述分类进行标注;(3) 一套临床回应评估协议。我们还利用大语言模型识别危机输入,并审计了五个模型在回应安全性与适宜性方面的表现。首先,我们建立了基于临床知识的危机分类体系与评估协议。随后,从超过239,000条用户输入中筛选出2,252条相关样本,并测试了三个大语言模型的自动分类能力。此外,我们评估了五个模型对用户危机回应的适宜性,采用5点李克特量表评分(1代表有害,5代表适宜)。研究发现,部分模型能对显性危机作出可靠回应,但风险依然存在。许多输出结果——尤其在自残与自杀类别中——存在不适宜或不安全的问题。不同模型表现差异显著:例如gpt-5-nano和deepseek-v3.2-exp的伤害率较低,而gpt-4o-mini和grok-4-fast则生成更多不安全回应。所有模型在处理间接信号、默认回复及语境错位时均存在困难。这些结果凸显了大语言模型亟需改进安全防护机制、危机检测能力及语境感知回应策略。研究同时表明,对齐技术与安全实践(而非仅依赖模型规模)对实现可靠的危机支持至关重要。我们构建的分类体系、数据集与评估方法将持续支持人工智能心理健康领域的研究,旨在减少伤害并保护脆弱用户群体。