Advances in large language models (LLMs) have empowered a variety of applications. However, there is still a significant gap in research when it comes to understanding and enhancing the capabilities of LLMs in the field of mental health. In this work, we present the first comprehensive evaluation of multiple LLMs, including Alpaca, Alpaca-LoRA, FLAN-T5, GPT-3.5, and GPT-4, on various mental health prediction tasks via online text data. We conduct a broad range of experiments, covering zero-shot prompting, few-shot prompting, and instruction fine-tuning. The results indicate a promising yet limited performance of LLMs with zero-shot and few-shot prompt designs for the mental health tasks. More importantly, our experiments show that instruction finetuning can significantly boost the performance of LLMs for all tasks simultaneously. Our best-finetuned models, Mental-Alpaca and Mental-FLAN-T5, outperform the best prompt design of GPT-3.5 (25 and 15 times bigger) by 10.9% on balanced accuracy and the best of GPT-4 (250 and 150 times bigger) by 4.8%. They further perform on par with the state-of-the-art task-specific language model. We also conduct an exploratory case study on LLMs' capability on the mental health reasoning tasks, illustrating the promising capability of certain models such as GPT-4. We summarize our findings into a set of action guidelines for potential methods to enhance LLMs' capability for mental health tasks. Meanwhile, we also emphasize the important limitations before achieving deployability in real-world mental health settings, such as known racial and gender bias. We highlight the important ethical risks accompanying this line of research.
翻译:大型语言模型(LLM)的进步赋能了多种应用。然而,在理解与增强LLM于心理健康领域的能力方面,研究仍存在显著差距。本研究首次对多个LLM(包括Alpaca、Alpaca-LoRA、FLAN-T5、GPT-3.5及GPT-4)在基于在线文本数据的多种心理健康预测任务中进行了全面评估。我们开展了广泛实验,涵盖零样本提示、少样本提示与指令微调。结果表明,LLM在心理健康任务中采用零样本与少样本提示设计时表现出潜力但性能有限。更重要的是,实验显示指令微调能同时显著提升LLM在所有任务上的性能。我们最优的微调模型Mental-Alpaca与Mental-FLAN-T5在平衡准确率上分别超越规模大25倍和15倍的GPT-3.5最优提示设计达10.9%,并超越规模大250倍和150倍的GPT-4最优性能4.8%。这些模型进一步达到了当前最先进任务特定语言模型的同等水平。我们还对LLM在心理健康推理任务中的能力进行了探索性案例研究,揭示了GPT-4等特定模型的显著潜力。我们将研究发现总结为一套行动指南,以提出增强LLM心理健康任务能力的潜在方法。同时,我们强调了在实现真实心理健康场景部署前存在的重要局限性,例如已知的种族与性别偏见。并指出伴随该研究方向的重要伦理风险。