The escalating global mental health crisis, marked by persistent treatment gaps, availability, and a shortage of qualified therapists, positions Large Language Models (LLMs) as a promising avenue for scalable support. While LLMs offer potential for accessible emotional assistance, their reliability, therapeutic relevance, and alignment with human standards remain challenging to address. This paper introduces a human-grounded evaluation methodology designed to assess LLM generated responses in therapeutic dialogue. Our approach involved curating a dataset of 500 mental health conversations from datasets with real-world scenario questions and evaluating the responses generated by nine diverse LLMs, including closed source and open source models. More specifically, these responses were evaluated by two psychiatric trained experts, who independently rated each on a 5 point Likert scale across a comprehensive 6 attribute rubric. This rubric captures Cognitive Support and Affective Resonance, providing a multidimensional perspective on therapeutic quality. Our analysis reveals that LLMs provide strong cognitive reliability by producing safe, coherent, and clinically appropriate information, but they demonstrate unstable affective alignment. Although closed source models (e.g., GPT-4o) offer balanced therapeutic responses, open source models show greater variability and emotional flatness. We reveal a persistent cognitive-affective gap and highlight the need for failure aware, clinically grounded evaluation frameworks that prioritize relational sensitivity alongside informational accuracy in mental health oriented LLMs. We advocate for balanced evaluation protocols with human in the loop that center on therapeutic sensitivity and provide a framework to guide the responsible design and clinical oversight of mental health oriented conversational AI.
翻译:全球心理健康危机持续加剧,其标志性特征包括长期存在的治疗缺口、资源可及性不足以及合格治疗师短缺,这使得大语言模型成为可扩展心理支持的重要潜在途径。尽管大语言模型为获取情感支持提供了可能性,但其可靠性、治疗相关性以及与人类标准的契合度仍是亟待解决的挑战。本文提出一种基于人工评估的方法论,旨在评估治疗性对话中大语言模型生成的响应。我们通过整理包含真实场景问题的数据集中500段心理健康对话构建评估数据集,并对九种不同类型的大语言模型(包括闭源与开源模型)生成的响应进行系统评估。具体而言,由两位经过精神病学培训的专家采用包含6个维度的评估体系,通过5级李克特量表对所有响应进行独立评分。该评估体系涵盖认知支持与情感共鸣两个核心维度,从多角度衡量治疗质量。分析表明:大语言模型能通过生成安全、连贯且符合临床规范的信息提供可靠的认知支持,但在情感契合度方面表现不稳定。虽然闭源模型(如GPT-4o)能提供均衡的治疗性响应,开源模型则表现出更大的变异性和情感扁平化特征。本研究揭示了持续存在的认知-情感鸿沟,强调需要建立具备故障感知能力、以临床实践为基础的评估框架,在心理健康导向的大语言模型中实现信息准确性与关系敏感性的平衡。我们主张建立以治疗敏感性为核心、融合人类评估的平衡评价机制,并为心理健康导向对话式人工智能的责任设计与临床监督提供框架指导。