Towards Interpretable Mental Health Analysis with ChatGPT

Automated mental health analysis shows great potential for enhancing the efficiency and accessibility of mental health care, with recent methods using pre-trained language models (PLMs) and incorporated emotional information. The latest large language models (LLMs), such as ChatGPT, exhibit dramatic capabilities on diverse natural language processing tasks. However, existing studies on ChatGPT for mental health analysis bear limitations in inadequate evaluations, ignorance of emotional information, and lack of explainability. To bridge these gaps, we comprehensively evaluate the mental health analysis and emotional reasoning ability of ChatGPT on 11 datasets across 5 tasks, and analyze the effects of various emotion-based prompting strategies. Based on these prompts, we further explore LLMs for interpretable mental health analysis by instructing them to also generate explanations for each of their decisions. With an annotation protocol designed by domain experts, we convey human evaluations to assess the quality of explanations generated by ChatGPT and GPT-3. The annotated corpus will be released for future research. Experimental results show that ChatGPT outperforms traditional neural network-based methods but still has a significant gap with advanced task-specific methods. Prompt engineering with emotional cues can be effective in improving performance on mental health analysis but suffers from a lack of robustness and inaccurate reasoning. In addition, ChatGPT significantly outperforms GPT-3 on all criteria in human evaluations of the explanations and approaches to human performance, showing its great potential in explainable mental health analysis.

翻译：自动心理健康分析在提升心理健康护理效率与可及性方面展现出巨大潜力，现有方法多采用预训练语言模型（PLMs）并融入情感信息。最新的大型语言模型（LLMs），如ChatGPT，在各类自然语言处理任务中表现出惊人的能力。然而，现有关于ChatGPT用于心理健康分析的研究存在评估不充分、忽视情感信息以及缺乏可解释性等局限。为弥补这些不足，我们系统性评估了ChatGPT在5项任务的11个数据集上的心理健康分析与情感推理能力，并分析了多种基于情感的提示策略的效果。基于这些提示，我们进一步探索了LLMs在可解释心理健康分析中的应用，通过指令要求其为每个决策生成解释。借助领域专家设计的标注协议，我们进行了人工评估以衡量ChatGPT与GPT-3生成解释的质量。标注语料库将公开以供后续研究。实验结果表明，ChatGPT优于传统基于神经网络的方法，但与先进的任务专用方法仍存在显著差距。引入情感线索的提示工程能有效提升心理健康分析性能，但存在鲁棒性不足与推理不准确的缺陷。此外，在人工评估中，ChatGPT在解释质量的所有指标上均显著优于GPT-3，且接近人类水平，展现出其在可解释心理健康分析中的巨大潜力。