As demand for mental health care outpaces clinician-delivered assessment, scalable screening tools are increasingly needed. Large language models (LLMs) may identify psychiatric risk from patient narratives, but their reliability across diagnoses, demographic subgroups, and evidence-use patterns remains uncertain. We introduce a SCID-anchored benchmark of 555 semi-structured experiential interviews paired with diagnostic reference labels for anxiety disorder, major depressive disorder, post-traumatic stress disorder, and any current mental health disorder. Using zero-shot task-specific prompting, we evaluated five state-of-the-art LLMs and examined whether false-negative errors reflected missed psychiatric evidence or differential weighting of symptom, functional-impairment, and protective-context cues. Performance varied across tasks and models, with accuracy ranging from 0.49 to 0.86 and Matthews correlation coefficients from 0.16 to 0.38. GPT-4.1 Mini and GPT-5 Mini showed the most consistent disorder-specific accuracy. Subgroup analyses found higher depression-classification accuracy among male than female participants, no consistent age-related pattern, and modest non-uniform variation across race strata. Evidence-integration analyses showed that false-negative anxiety and PTSD classifications often contained explicit symptom evidence but were accompanied by preserved functioning, coping ability, or social support. Functional-impairment evidence shifted model outputs toward positive classifications, whereas protective-context evidence shifted outputs away. These findings suggest that LLMs may support scalable psychiatric screening, but their tendency to discount symptom evidence in the presence of preserved functioning or protective context requires careful validation before clinical deployment.
翻译:随着心理健康护理需求超过临床医生的评估能力,可扩展的筛查工具日益迫切。大语言模型(LLMs)能从患者叙述中识别精神风险,但其跨诊断类别、人口统计亚组及证据使用模式的可靠性仍不确定。我们构建了一个基于SCID锚定的基准数据集,包含555次半结构化体验性访谈及对应的焦虑症、重度抑郁症、创伤后应激障碍和当前任一心理健康障碍的诊断参考标签。采用零样本任务特定提示,评估了五种最先进的LLMs,并考察了假阴性错误是源于遗漏精神证据,还是对症状、功能损害及保护性情境线索的差异化加权。模型与任务间的性能存在差异,准确率范围为0.49至0.86,马修斯相关系数范围为0.16至0.38。GPT-4.1 Mini和GPT-5 Mini在特定障碍分类中展现了最一致的准确率。亚组分析发现,抑郁症分类中男性参与者准确率高于女性,无一致年龄相关模式,种族阶层间存在适度非均匀变异。证据整合分析显示,假阴性焦虑和PTSD分类常包含明确的症状证据,但伴随功能保留、应对能力或社会支持。功能损害证据促使模型输出偏向阳性分类,而保护性情境证据则使输出偏离。这些发现表明,LLMs可支持可扩展的精神筛查,但其在存在功能保留或保护性情境时忽视症状证据的倾向,需在临床应用前进行审慎验证。