The recent technology boost of large language models (LLMs) has empowered a variety of applications. However, there is very little research on understanding and improving LLMs' capability for the mental health domain. In this work, we present the first comprehensive evaluation of multiple LLMs, including Alpaca, Alpaca-LoRA, and GPT-3.5, on various mental health prediction tasks via online text data. We conduct a wide range of experiments, covering zero-shot prompting, few-shot prompting, and instruction finetuning. The results indicate the promising yet limited performance of LLMs with zero-shot and few-shot prompt designs for mental health tasks. More importantly, our experiments show that instruction finetuning can significantly boost the performance of LLMs for all tasks simultaneously. Our best-finetuned model, Mental-Alpaca, outperforms GPT-3.5 (25 times bigger) by 16.7\% on balanced accuracy and performs on par with the state-of-the-art task-specific model. We summarize our findings into a set of action guidelines for future researchers, engineers, and practitioners on how to empower LLMs with better mental health domain knowledge and become an expert in mental health prediction tasks.
翻译:近期大语言模型(LLMs)的技术进步推动了众多应用的发展,但针对其在心理健康领域能力的研究仍十分有限。本研究首次系统性评估了多种大语言模型——包括 Alpaca、Alpaca-LoRA 和 GPT-3.5——在基于在线文本数据的心理健康预测任务中的表现。我们开展了涵盖零样本提示、少样本提示及指令微调的系列实验。结果表明,LLMs 在零样本和少样本提示设计下对心理健康任务展现出有前景但有限的能力。更重要的是,实验证实指令微调能同时显著提升 LLMs 在所有任务中的性能。我们最优微调模型 Mental-Alpaca 在平衡准确率上比规模大 25 倍的 GPT-3.5 高出 16.7%,并与当前最先进的特定任务模型性能相当。基于研究发现,我们为未来研究人员、工程师和从业者总结出一套行动指南,指导如何增强 LLMs 的心理健康领域知识,使其成为心理健康预测任务的专家。