Evaluating an evidence-guided reinforcement learning framework in aligning light-parameter large language models with decision-making cognition in psychiatric clinical reasoning

翻译：评估证据引导的强化学习框架在轻量参数大语言模型与精神科临床推理决策认知对齐中的应用

Xinxin Lin,Guangxin Dai,Yi Zhong,Xiang Li,Xue Xiao,Yixin Zhang,Zhengdong Wu,Yongbo Zheng,Runchuan Zhu,Ming Zhao,Huizi Yu,Shuo Wu,Jun Zhao,Lingming Hu,Yumei Wang,Ping Yin,Joey W. Y. Chan,Ngan Yin Chan,Sijing Chen,Yun Kwok Wing,Lin Lu,Xin Ma,Lizhou Fan

from arxiv, 21 pages, 8 figures

Large language models (LLMs) hold transformative potential for medical decision support yet their application in psychiatry remains constrained by hallucinations and superficial reasoning. This limitation is particularly acute in light-parameter LLMs which are essential for privacy-preserving and efficient clinical deployment. Existing training paradigms prioritize linguistic fluency over structured clinical logic and result in a fundamental misalignment with professional diagnostic cognition. Here we introduce ClinMPO, a reinforcement learning framework designed to align the internal reasoning of LLMs with professional psychiatric practice. The framework employs a specialized reward model trained independently on a dataset derived from 4,474 psychiatry journal articles and structured according to evidence-based medicine principles. We evaluated ClinMPO on a unseen subset of the benchmark designed to isolate reasoning capabilities from rote memorization. This test set comprises items where leading large-parameter LLMs consistently fail. We compared the ClinMPO-aligned light LLM performance against a cohort of 300 medical students. The ClinMPO-tuned Qwen3-8B model achieved a diagnostic accuracy of 31.4% and surpassed the human benchmark of 30.8% on these complex cases. These results demonstrate that medical evidence-guided optimization enables light-parameter LLMs to master complex reasoning tasks. Our findings suggest that explicit cognitive alignment offers a scalable pathway to reliable and safe psychiatric decision support.

翻译：大语言模型（LLM）在医疗决策支持领域具有变革性潜力，但其在精神病学中的应用仍受限于幻觉问题与浅层推理。这一局限在轻量参数大语言模型中尤为突出，而此类模型对于保护隐私和实现高效临床部署至关重要。现有训练范式优先考虑语言流畅性而非结构化临床逻辑，导致模型与专业诊断认知存在根本性错位。本文提出ClinMPO强化学习框架，旨在将大语言模型的内部推理过程与专业精神病学实践对齐。该框架采用经独立训练的专用奖励模型，其训练数据集源自4,474篇精神病学期刊论文，并依据循证医学原则进行结构化构建。我们在专为分离推理能力与机械记忆而设计的基准测试未见过子集上评估ClinMPO，该测试集包含主流大参数大语言模型持续失败的案例。我们将ClinMPO对齐的轻量化大语言模型与300名医学生群体进行性能比较。经ClinMPO调优的Qwen3-8B模型在这些复杂案例中实现了31.4%的诊断准确率，超越了30.8%的人类基准。结果表明，医学证据引导的优化能使轻量参数大语言模型掌握复杂推理任务。我们的研究提示，显式认知对齐为实现可靠且安全的精神病学决策支持提供了可扩展路径。