Fine Tuning Large Language Models to Deliver CBT for Depression

Cognitive Behavioral Therapy (CBT) is a well-established, evidence-based treatment for Major Depressive Disorder. Unfortunately, there exist significant barriers to individuals accessing CBT, including cost, scarcity of therapists and stigma. This study explores the feasibility of fine-tuning small open weight large language models (LLMs) to deliver CBT for depression. Using 58 sets of synthetic CBT transcripts generated by the Nous Research fine-tune of Llama 3.1 405b, we fine-tuned three models: Mistral 7b v0.3, Qwen 2.5 7b, and Llama 3.1 8b. CBT fidelity was evaluated through a modified Cognitive Therapy Rating Scale (CTRS). All fine-tuned models were compared against each other, as well as their instruct-tuned variants. Simulated patient transcripts were generated for the purpose of evaluating model performance, with the instruct and CBT-tuned models acting as the therapist and DeepSeek-V2.5 acting as the patient. These simulated transcripts were evaluated on a modified CTRS by Gemini 1.5 Pro-002. Our findings demonstrated that the CBT-tuned models significantly outperformed their instruct-tuned counterparts, with an average improvement of 11.33 points (p < 0.001) on total CTRS score. Llama 3.1 8b had the strongest performance (mean CTRS score 67.86 +/- 7.24), followed by Qwen 2.5 7b (64.28 +/- 9.55) and Mistral 7b v0.3 (64.17 +/- 9.79), with these differences between models being statistically significant. The CBT-tuned models were competent in implementing core CBT techniques and providing empathetic responses, however, there were limitations observed in agenda adherence, exploration depth and long-context coherence. This study establishes that CBT specific fine-tuning can effectively encode therapeutic competencies in small LLMs, though significant technical and ethical considerations must be resolved prior to clinical deployment.

翻译：认知行为疗法（CBT）是治疗重度抑郁症的一种成熟且基于证据的方法。然而，个体在获取CBT方面存在显著障碍，包括费用高昂、治疗师稀缺以及社会污名化。本研究探讨了微调小型开源权重大型语言模型（LLMs）以提供抑郁症CBT的可行性。利用由Nous Research微调的Llama 3.1 405b生成的58组合成CBT对话记录，我们对三个模型进行了微调：Mistral 7b v0.3、Qwen 2.5 7b和Llama 3.1 8b。通过修改版的认知治疗评定量表（CTRS）评估了CBT保真度。所有微调模型均相互比较，并与它们的指令微调变体进行对比。为评估模型性能，生成了模拟患者对话记录，其中指令微调模型和CBT微调模型扮演治疗师角色，DeepSeek-V2.5扮演患者角色。这些模拟对话记录由Gemini 1.5 Pro-002使用修改版CTRS进行评估。我们的研究结果表明，CBT微调模型在CTRS总分上显著优于其指令微调对应模型，平均提升11.33分（p < 0.001）。Llama 3.1 8b表现最佳（平均CTRS得分67.86 +/- 7.24），其次是Qwen 2.5 7b（64.28 +/- 9.55）和Mistral 7b v0.3（64.17 +/- 9.79），模型间的差异具有统计学意义。CBT微调模型能够有效实施核心CBT技术并提供共情回应，但在议程遵循、探索深度和长上下文连贯性方面存在局限性。本研究证实，针对CBT的特定微调可以有效将治疗能力编码到小型LLMs中，但在临床部署前仍需解决重大的技术和伦理问题。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日