Mental health disorders represent a burgeoning global public health challenge. While Large Language Models (LLMs) have demonstrated potential in psychiatric assessment, their clinical utility is severely constrained by benchmarks that lack ecological validity and fine-grained diagnostic supervision. To bridge this gap, we introduce \textbf{MentalDx Bench}, the first benchmark dedicated to disorder-level psychiatric diagnosis within real-world clinical settings. Comprising 712 de-identified electronic health records annotated by board-certified psychiatrists under ICD-11 guidelines, the benchmark covers 76 disorders across 16 diagnostic categories. Evaluation of 18 LLMs reveals a critical \textit{paradigm misalignment}: strong performance at coarse diagnostic categorization contrasts with systematic failure at disorder-level diagnosis, underscoring a gap between pattern-based modeling and clinical hypothetico-deductive reasoning. In response, we propose \textbf{MentalSeek-Dx}, a medical-specialized LLM trained to internalize this clinical reasoning process through supervised trajectory construction and curriculum-based reinforcement learning. Experiments on MentalDx Bench demonstrate that MentalSeek-Dx achieves state-of-the-art (SOTA) performance with only 14B parameters, establishing a clinically grounded framework for reliable psychiatric diagnosis.
翻译:精神健康障碍正日益成为一项全球性的公共卫生挑战。尽管大型语言模型(LLMs)在精神科评估中展现出潜力,但其临床应用因缺乏生态效度和细粒度诊断监督的基准测试而受到严重制约。为弥合这一差距,我们引入了首个专注于真实世界临床环境中疾病级别精神科诊断的基准测试——\textbf{MentalDx Bench}。该基准包含712份经委员会认证的精神科医生依据ICD-11指南标注的去标识化电子健康记录,涵盖16个诊断类别下的76种疾病。对18个LLMs的评估揭示了一个关键的\textit{范式失配}:在粗粒度诊断分类上的强劲表现与在疾病级别诊断上的系统性失败形成鲜明对比,凸显了基于模式建模与临床假设-演绎推理之间的差距。为此,我们提出了\textbf{MentalSeek-Dx},这是一个医学专用LLM,通过监督式轨迹构建和基于课程学习的强化学习来内化这一临床推理过程。在MentalDx Bench上的实验表明,MentalSeek-Dx仅以140亿参数即实现了最先进的性能,为可靠的精神科诊断建立了一个临床基础扎实的框架。