推理优于记忆：评估通用架构与专用微调在基于RAG的心理健康对话系统中的效能 (Reasoning Over Recall: Evaluating the Efficacy of Generalist Architectures vs. Specialized Fine-Tunes in RAG-Based Mental Health Dialogue Systems)

The deployment of Large Language Models (LLMs) in mental health counseling faces the dual challenges of hallucinations and lack of empathy. While the former may be mitigated by RAG (retrieval-augmented generation) by anchoring answers in trusted clinical sources, there remains an open question as to whether the most effective model under this paradigm would be one that is fine-tuned on mental health data, or a more general and powerful model that succeeds purely on the basis of reasoning. In this paper, we perform a direct comparison by running four open-source models through the same RAG pipeline using ChromaDB: two generalist reasoners (Qwen2.5-3B and Phi-3-Mini) and two domain-specific fine-tunes (MentalHealthBot-7B and TherapyBot-7B). We use an LLM-as-a-Judge framework to automate evaluation over 50 turns. We find a clear trend: the generalist models outperform the domain-specific ones in empathy (3.72 vs. 3.26, $p < 0.001$) in spite of being much smaller (3B vs. 7B), and all models perform well in terms of safety, but the generalist models show better contextual understanding and are less prone to overfitting as we observe in the domain-specific models. Overall, our results indicate that for RAG-based therapy systems, strong reasoning is more important than training on mental health-specific vocabulary; i.e. a well-reasoned general model would provide more empathetic and balanced support than a larger narrowly fine-tuned model, so long as the answer is already grounded in clinical evidence.

翻译：大型语言模型在心理健康咨询领域的应用面临幻觉与共情缺失的双重挑战。虽然检索增强生成技术可通过将回答锚定于可信临床资料来缓解前者，但该范式下最有效的模型究竟是经过心理健康数据微调的专用模型，还是纯粹依靠推理能力制胜的通用强模型，仍存在开放性问题。本文通过将四个开源模型置于相同的ChromaDB检索增强生成流程中进行直接比较：两个通用推理模型与两个领域专用微调模型。我们采用LLM即评委框架对50轮对话进行自动化评估。研究发现明确趋势：通用模型在共情维度显著优于领域专用模型，尽管参数量仅为后者的43%；所有模型均表现良好的安全性，但通用模型展现出更优的上下文理解能力，且未出现领域专用模型中观察到的过拟合现象。总体而言，研究结果表明：对于基于检索增强生成的治疗系统，强大的推理能力比心理健康专业词汇的训练更为重要；只要回答已基于临床证据，推理能力强的通用模型能比大规模窄域微调模型提供更具共情力且更平衡的支持。

相关内容

健康

关注 27

健康是指一个人在身体、精神和社会等方面都处于良好的状态。健康包括两个方面的内容：

一是主要脏器无疾病，身体形态发育良好，体形均匀，人体各系统具有良好的生理功能，有较强的身体活动能力和劳动能力，这是对健康最基本的要求；

二是对疾病的抵抗能力较强，能够适应环境变化，各种生理刺激以及致病因素对身体的作用。传统的健康观是“无病即健康”，现代人的健康观是整体健康，世界卫生组织提出“健康不仅是躯体没有疾病，还要具备心理健康、社会适应良好和有道德”。因此，现代人的健康内容包括：躯体健康、心理健康、心灵健康、社会健康、智力健康、道德健康、环境健康等。健康是人的基本权利。健康是人生的第一财富。

【NeurIPS2025】迈向开放世界的三维“物体性”学习

专知会员服务

11+阅读 · 2025年10月21日

【NeurIPS 2024】基于大型语言模型的三层学习用于时间序列OOD泛化

专知会员服务

19+阅读 · 2024年10月13日

LLM in Medical Domain: 大语言模型在医学领域的应用

专知会员服务

103+阅读 · 2023年6月17日

【ICML2023】SEGA:结构熵引导的图对比学习锚视图

专知会员服务

23+阅读 · 2023年5月10日