Mental disorders are highly prevalent worldwide, but the shortage of psychiatrists and the inherent subjectivity of interview-based diagnosis create substantial barriers to timely and consistent mental-health assessment. Progress in AI-assisted psychiatric diagnosis is constrained by the absence of benchmarks that simultaneously provide realistic patient simulation, clinician-verified diagnostic labels, and support for dynamic multi-turn consultation. We present LingxiDiagBench, a large-scale multi-agent benchmark that evaluates LLMs on both static diagnostic inference and dynamic multi-turn psychiatric consultation in Chinese. At its core is LingxiDiag-16K, a dataset of 16,000 EMR-aligned synthetic consultation dialogues designed to reproduce real clinical demographic and diagnostic distributions across 12 ICD-10 psychiatric categories. Through extensive experiments across state-of-the-art LLMs, we establish key findings: (1) although LLMs achieve high accuracy on binary depression--anxiety classification (up to 92.3%), performance deteriorates substantially for depression--anxiety comorbidity recognition (43.0%) and 12-way differential diagnosis (28.5%); (2) dynamic consultation often underperforms static evaluation, indicating that ineffective information-gathering strategies significantly impair downstream diagnostic reasoning; (3) consultation quality assessed by LLM-as-a-Judge shows only moderate correlation with diagnostic accuracy, suggesting that well-structured questioning alone does not ensure correct diagnostic decisions. We release LingxiDiag-16K and the full evaluation framework to support reproducible research at https://github.com/Lingxi-mental-health/LingxiDiagBench.
翻译:精神障碍在全球范围内具有高患病率,但精神科医生短缺以及基于访谈的诊断方式固有的主观性,为及时且一致的精神健康评估带来了巨大障碍。人工智能辅助精神科诊断的进展受到缺乏综合性基准的限制,这些基准需要同时提供真实的患者模拟、临床医生验证的诊断标签以及动态多轮问诊支持。我们提出了灵犀诊断基准,这是一个大规模多智能体基准,用于评估大语言模型在中文环境下的静态诊断推理和动态多轮精神科问诊能力。其核心是灵犀诊断-16K数据集,该数据集包含16,000条与电子病历对齐的合成问诊对话,旨在复现涵盖12个ICD-10精神障碍类别的真实临床人口统计学及诊断分布。通过对多个前沿大语言模型进行广泛实验,我们得出以下关键发现:(1) 尽管大语言模型在二元抑郁-焦虑分类上达到了较高准确率(最高92.3%),但其在抑郁-焦虑共病识别(43.0%)和12类鉴别诊断(28.5%)上的表现显著下降;(2) 动态问诊的表现常常低于静态评估,这表明无效的信息收集策略会严重损害下游的诊断推理;(3) 通过LLM-as-a-Judge评估的问诊质量与诊断准确率仅呈中等程度相关,表明仅凭结构良好的提问并不能确保做出正确的诊断决策。我们在https://github.com/Lingxi-mental-health/LingxiDiagBench 发布了灵犀诊断-16K数据集及完整的评估框架,以支持可复现的研究。