Mental-LLM: Leveraging Large Language Models for Mental Health Prediction via Online Text Data

Advances in large language models (LLMs) have empowered a variety of applications. However, there is still a significant gap in research when it comes to understanding and enhancing the capabilities of LLMs in the field of mental health. In this work, we present a comprehensive evaluation of multiple LLMs on various mental health prediction tasks via online text data, including Alpaca, Alpaca-LoRA, FLAN-T5, GPT-3.5, and GPT-4. We conduct a broad range of experiments, covering zero-shot prompting, few-shot prompting, and instruction fine-tuning. The results indicate a promising yet limited performance of LLMs with zero-shot and few-shot prompt designs for mental health tasks. More importantly, our experiments show that instruction finetuning can significantly boost the performance of LLMs for all tasks simultaneously. Our best-finetuned models, Mental-Alpaca and Mental-FLAN-T5, outperform the best prompt design of GPT-3.5 (25 and 15 times bigger) by 10.9% on balanced accuracy and the best of GPT-4 (250 and 150 times bigger) by 4.8%. They further perform on par with the state-of-the-art task-specific language model. We also conduct an exploratory case study on LLMs' capability on mental health reasoning tasks, illustrating the promising capability of certain models such as GPT-4. We summarize our findings into a set of action guidelines for potential methods to enhance LLMs' capability for mental health tasks. Meanwhile, we also emphasize the important limitations before achieving deployability in real-world mental health settings, such as known racial and gender bias. We highlight the important ethical risks accompanying this line of research.

翻译：大型语言模型（LLM）的进步推动了多种应用的发展。然而，在理解和增强LLM在心理健康领域能力的研究方面仍存在显著差距。本研究通过在线文本数据，对多种LLM（包括Alpaca、Alpaca-LoRA、FLAN-T5、GPT-3.5和GPT-4）在各类心理健康预测任务上进行了全面评估。我们开展了一系列广泛实验，涵盖零样本提示、少样本提示及指令微调。结果表明，LLM在心理健康任务中采用零样本和少样本提示设计时表现虽具潜力但仍有限。更重要的是，我们的实验显示，指令微调能同时显著提升LLM在所有任务中的性能。我们最优的微调模型Mental-Alpaca和Mental-FLAN-T5在平衡准确率上分别超越最佳提示设计的GPT-3.5（体积大25倍和15倍）10.9%，以及最优的GPT-4（体积大250倍和150倍）4.8%。这些模型进一步达到了与当前最先进的任务专用语言模型相当的性能。我们针对LLM在心理健康推理任务中的能力进行了探索性案例研究，展示了GPT-4等特定模型的可观潜力。我们将发现总结为一套行动指南，提出增强LLM在心理健康任务中能力的潜在方法。同时，我们也强调了在实现真实世界心理健康环境部署前需克服的重要局限，例如已知的种族和性别偏见，并指出了伴随该研究领域的重要伦理风险。