The deployment of Large Language Models (LLMs) in mental health counseling faces the dual challenges of hallucinations and lack of empathy. While the former may be mitigated by RAG (retrieval-augmented generation) by anchoring answers in trusted clinical sources, there remains an open question as to whether the most effective model under this paradigm would be one that is fine-tuned on mental health data, or a more general and powerful model that succeeds purely on the basis of reasoning. In this paper, we perform a direct comparison by running four open-source models through the same RAG pipeline using ChromaDB: two generalist reasoners (Qwen2.5-3B and Phi-3-Mini) and two domain-specific fine-tunes (MentalHealthBot-7B and TherapyBot-7B). We use an LLM-as-a-Judge framework to automate evaluation over 50 turns. We find a clear trend: the generalist models outperform the domain-specific ones in empathy (3.72 vs. 3.26, $p < 0.001$) in spite of being much smaller (3B vs. 7B), and all models perform well in terms of safety, but the generalist models show better contextual understanding and are less prone to overfitting as we observe in the domain-specific models. Overall, our results indicate that for RAG-based therapy systems, strong reasoning is more important than training on mental health-specific vocabulary; i.e. a well-reasoned general model would provide more empathetic and balanced support than a larger narrowly fine-tuned model, so long as the answer is already grounded in clinical evidence.
翻译:大型语言模型在心理健康咨询领域的应用面临幻觉与共情缺失的双重挑战。虽然检索增强生成技术可通过将回答锚定于可信临床资料来缓解前者,但该范式下最有效的模型究竟是经过心理健康数据微调的专用模型,还是纯粹依靠推理能力制胜的通用强模型,仍存在开放性问题。本文通过将四个开源模型置于相同的ChromaDB检索增强生成流程中进行直接比较:两个通用推理模型与两个领域专用微调模型。我们采用LLM即评委框架对50轮对话进行自动化评估。研究发现明确趋势:通用模型在共情维度显著优于领域专用模型,尽管参数量仅为后者的43%;所有模型均表现良好的安全性,但通用模型展现出更优的上下文理解能力,且未出现领域专用模型中观察到的过拟合现象。总体而言,研究结果表明:对于基于检索增强生成的治疗系统,强大的推理能力比心理健康专业词汇的训练更为重要;只要回答已基于临床证据,推理能力强的通用模型能比大规模窄域微调模型提供更具共情力且更平衡的支持。