We introduce MentalBench, a benchmark for evaluating psychiatric diagnostic decision-making in large language models (LLMs). Existing mental health benchmarks largely rely on social media data, limiting their ability to assess DSM-grounded diagnostic judgments. At the core of MentalBench is MentalKG, a psychiatrist-built and validated knowledge graph encoding DSM-5 diagnostic criteria and differential diagnostic rules for 23 psychiatric disorders. Using MentalKG as a golden-standard logical backbone, we generate 24,750 synthetic clinical cases that systematically vary in information completeness and diagnostic complexity, enabling low-noise and interpretable evaluation. Our experiments show that while state-of-the-art LLMs perform well on structured queries probing DSM-5 knowledge, they struggle to calibrate confidence in diagnostic decision-making when distinguishing between clinically overlapping disorders. These findings reveal evaluation gaps not captured by existing benchmarks.
翻译:我们提出了MentalBench,一个用于评估大型语言模型在精神疾病诊断决策能力的基准。现有的心理健康基准主要依赖社交媒体数据,限制了其评估基于DSM诊断判断的能力。MentalBench的核心是MentalKG,这是一个由精神科医生构建并验证的知识图谱,它编码了DSM-5中23种精神障碍的诊断标准和鉴别诊断规则。以MentalKG作为黄金标准的逻辑主干,我们生成了24,750个合成临床案例,这些案例在信息完整性和诊断复杂性上系统性地变化,从而实现了低噪声且可解释的评估。我们的实验表明,尽管最先进的LLMs在探查DSM-5知识的结构化查询上表现良好,但在区分临床重叠疾病时,它们难以校准诊断决策的置信度。这些发现揭示了现有基准未能捕捉到的评估差距。