Between Help and Harm: An Evaluation of Mental Health Crisis Handling by LLMs

Large language model-powered chatbots have transformed how people seek information, especially in high-stakes contexts like mental health. Despite their support capabilities, safe detection and response to crises such as suicidal ideation and self-harm are still unclear, hindered by the lack of unified crisis taxonomies and clinical evaluation standards. We address this by creating: (1) a taxonomy of six crisis categories; (2) a dataset of over 2,000 inputs from 12 mental health datasets, classified into these categories; and (3) a clinical response assessment protocol. We also use LLMs to identify crisis inputs and audit five models for response safety and appropriateness. First, we built a clinical-informed crisis taxonomy and evaluation protocol. Next, we curated 2,252 relevant examples from over 239,000 user inputs, then tested three LLMs for automatic classification. In addition, we evaluated five models for the appropriateness of their responses to a user's crisis, graded on a 5-point Likert scale from harmful (1) to appropriate (5). While some models respond reliably to explicit crises, risks still exist. Many outputs, especially in self-harm and suicidal categories, are inappropriate or unsafe. Different models perform variably; some, like gpt-5-nano and deepseek-v3.2-exp, have low harm rates, but others, such as gpt-4o-mini and grok-4-fast, generate more unsafe responses. All models struggle with indirect signals, default replies, and context misalignment. These results highlight the urgent need for better safeguards, crisis detection, and context-aware responses in LLMs. They also show that alignment and safety practices, beyond scale, are crucial for reliable crisis support. Our taxonomy, datasets, and evaluation methods support ongoing AI mental health research, aiming to reduce harm and protect vulnerable users.

翻译：大语言模型驱动的聊天机器人已深刻改变了人们获取信息的方式，尤其在心理健康等高风险情境中。尽管此类模型具备支持能力，但在应对自杀意念、自伤等危机时，其安全检测与响应机制仍不明确，这主要受限于缺乏统一的危机分类体系与临床评估标准。为此，我们构建了以下内容：（1）包含六类危机类型的分类体系；（2）基于12个心理健康数据集、涵盖2000余条输入样本的分类数据集；（3）临床响应评估协议。同时，我们利用大语言模型识别危机输入，并对五个模型的响应安全性与适宜性进行审计。首先，我们建立了临床启发式危机分类体系与评估协议；其次，从超过239,000条用户输入中筛选出2,252个相关示例，测试三种大语言模型自动分类性能。此外，我们采用五级李克特量表（从有害[1分]到适宜[5分]）评估五个模型对用户危机响应的适宜性。结果显示，部分模型对显性危机响应可靠，但仍存在风险：许多输出（尤其涉及自伤与自杀类别的场景）不当或存在安全隐患。不同模型表现差异显著——如gpt-5-nano与deepseek-v3.2-exp等模型的伤害率较低，而gpt-4o-mini与grok-4-fast等模型则产生更多不安全响应。所有模型在处理间接信号、默认回复及语境错位方面存在共性缺陷。这些结果凸显了亟需在大语言模型中强化安全防护、危机检测与语境感知响应能力。研究表明，除模型规模外，对齐策略与安全实践对实现可靠的危机支持至关重要。我们提出的分类体系、数据集与评估方法将持续推动人工智能心理健康研究发展，旨在降低风险、保护弱势用户。