Background: Rapid advancements in natural language processing have led to the development of large language models with the potential to revolutionize mental health care. These models have shown promise in assisting clinicians and providing support to individuals experiencing various psychological challenges. Objective: This study aims to compare the performance of two large language models, GPT-4 and Chat-GPT, in responding to a set of 18 psychological prompts, to assess their potential applicability in mental health care settings. Methods: A blind methodology was employed, with a clinical psychologist evaluating the models' responses without knowledge of their origins. The prompts encompassed a diverse range of mental health topics, including depression, anxiety, and trauma, to ensure a comprehensive assessment. Results: The results demonstrated a significant difference in performance between the two models (p > 0.05). GPT-4 achieved an average rating of 8.29 out of 10, while Chat-GPT received an average rating of 6.52. The clinical psychologist's evaluation suggested that GPT-4 was more effective at generating clinically relevant and empathetic responses, thereby providing better support and guidance to potential users. Conclusions: This study contributes to the growing body of literature on the applicability of large language models in mental health care settings. The findings underscore the importance of continued research and development in the field to optimize these models for clinical use. Further investigation is necessary to understand the specific factors underlying the performance differences between the two models and to explore their generalizability across various populations and mental health conditions.
翻译:背景:自然语言处理的快速进步催生了具有革新心理健康护理潜力的大型语言模型。这类模型在辅助临床工作者、为经历各类心理挑战的个体提供支持方面展现出前景。目的:本研究旨在比较GPT-4与Chat-GPT两个大型语言模型对18组心理提示的应答表现,评估其在心理健康护理场景中的潜在适用性。方法:采用盲法设计,由临床心理学家在不知晓应答来源的情况下评估模型输出结果。提示内容涵盖抑郁、焦虑、创伤等多元化心理健康主题,以确保评估的全面性。结果:结果显示两个模型表现存在显著差异(p > 0.05)。GPT-4平均得分为8.29分(满分10分),Chat-GPT平均得分为6.52分。临床心理学家的评估表明,GPT-4在生成临床相关且富有共情力的应答方面更具效能,从而为潜在用户提供更优质的支持与指导。结论:本研究丰富了大型语言模型在心理健康护理场景适用性的现有文献。研究结果凸显了持续开展领域研发以优化临床适用模型的重要性。需进一步探究两个模型性能差异的具体成因,并考察其在不同人群及心理疾病谱系中的泛化能力。