Millions now use generative AI chatbots for psychological support. Despite the promise related to availability and scale, the single most pressing question in AI for mental health is whether these tools are safe. The Validation of Ethical and Responsible AI in Mental Health (VERA-MH) evaluation was recently proposed to meet the urgent need for an evidence-based, automated safety benchmark. This study aimed to examine the clinical validity and reliability of VERA-MH for evaluating AI safety in suicide risk detection and response. We first simulated a large set of conversations between large language model (LLM)-based users (user-agents) and general-purpose AI chatbots. Licensed mental health clinicians used a rubric (scoring guide) to independently rate the simulated conversations for safe and unsafe chatbot behaviors, as well as user-agent realism. An LLM-based judge used the same scoring rubric to evaluate the same set of simulated conversations. We then examined rating alignment (a) among individual clinicians and (b) between clinician consensus and the LLM judge, and (c) summarized clinicians' ratings of user-agent realism. Individual clinicians were generally consistent with one another in their safety ratings (chance-corrected inter-rater reliability [IRR] = 0.77), establishing a gold-standard clinical reference. The LLM judge was strongly aligned with this clinical consensus overall (IRR = 0.81) and within key conditions. Together, findings from this human evaluation study support the validity and reliability of VERA-MH: an open-source, automated AI safety evaluation for mental health. Future research will examine the generalizability and robustness of VERA-MH and expand the framework to target additional key areas of AI safety in mental health.
翻译:目前有数百万人使用生成式人工智能聊天机器人寻求心理支持。尽管其在可及性和规模方面前景广阔,但人工智能在心理健康领域最紧迫的问题是这些工具是否安全。为满足对基于证据的自动化安全基准的迫切需求,近期提出了心理健康领域伦理与负责任人工智能验证(VERA-MH)评估。本研究旨在检验VERA-MH在自杀风险检测与响应方面评估人工智能安全性的临床效度与信度。我们首先模拟了大量基于大语言模型的用户代理与通用人工智能聊天机器人之间的对话。持照心理健康临床医生使用评分指南,独立对模拟对话中聊天机器人的安全与不安全行为以及用户代理的真实性进行评级。一位基于大语言模型的评判员使用相同的评分指南对同一组模拟对话进行评估。随后,我们检验了(a)临床医生个体之间、(b)临床医生共识与大语言模型评判员之间的评分一致性,并(c)汇总了临床医生对用户代理真实性的评分。临床医生个体在安全性评分上总体相互一致(校正机会的评分者间信度[IRR] = 0.77),从而建立了黄金标准的临床参照。大语言模型评判员总体上与该临床共识高度一致(IRR = 0.81),且在关键情境下亦如此。综上,这项人工评估研究的结果支持VERA-MH——一项用于心理健康领域的开源、自动化人工智能安全评估——的效度与信度。未来研究将检验VERA-MH的普适性与稳健性,并扩展该框架以覆盖心理健康领域人工智能安全的其他关键方面。