When Symptoms Are Not Enough: Evidence-Weighting Patterns in Large Language Model Psychiatric Screening

As demand for mental health care outpaces clinician-delivered assessment, scalable screening tools are increasingly needed. Large language models (LLMs) may identify psychiatric risk from patient narratives, but their reliability across diagnoses, demographic subgroups, and evidence-use patterns remains uncertain. We introduce a SCID-anchored benchmark of 555 semi-structured experiential interviews paired with diagnostic reference labels for anxiety disorder, major depressive disorder, post-traumatic stress disorder, and any current mental health disorder. Using zero-shot task-specific prompting, we evaluated five state-of-the-art LLMs and examined whether false-negative errors reflected missed psychiatric evidence or differential weighting of symptom, functional-impairment, and protective-context cues. Performance varied across tasks and models, with accuracy ranging from 0.49 to 0.86 and Matthews correlation coefficients from 0.16 to 0.38. GPT-4.1 Mini and GPT-5 Mini showed the most consistent disorder-specific accuracy. Subgroup analyses found higher depression-classification accuracy among male than female participants, no consistent age-related pattern, and modest non-uniform variation across race strata. Evidence-integration analyses showed that false-negative anxiety and PTSD classifications often contained explicit symptom evidence but were accompanied by preserved functioning, coping ability, or social support. Functional-impairment evidence shifted model outputs toward positive classifications, whereas protective-context evidence shifted outputs away. These findings suggest that LLMs may support scalable psychiatric screening, but their tendency to discount symptom evidence in the presence of preserved functioning or protective context requires careful validation before clinical deployment.

翻译：随着精神卫生保健需求超过临床医生提供的评估，可扩展的筛查工具日益必需。大型语言模型（LLMs）可能从患者叙事中识别精神风险，但其在诊断、人口统计学亚组及证据使用模式上的可靠性仍不确定。我们引入了一个基于SCID标准的基准数据集，包含555个半结构化体验访谈及其对应的焦虑障碍、重度抑郁障碍、创伤后应激障碍及任何当前精神健康障碍的诊断参考标签。通过零样本任务特定提示，我们评估了五种最先进的LLMs，并考察假阴性错误是否源于遗漏精神证据或对症状、功能损害和保护性情境线索的差异化加权。各模型在不同任务上的表现存在差异，准确率范围为0.49至0.86，马修斯相关系数范围为0.16至0.38。GPT-4.1 Mini和GPT-5 Mini在特定障碍的准确率上表现最为一致。亚组分析发现，抑郁症分类准确率在男性参与者中高于女性，未观察到一致的年龄相关模式，种族层间存在适度非均匀变异。证据整合分析表明，假阴性的焦虑和PTSD分类通常包含明确的症状证据，但伴随功能保留、应对能力或社会支持。功能损害证据促使模型输出向阳性分类偏移，而保护性情境证据则使输出反向偏移。这些发现表明，LLMs可能支持可扩展的精神筛查，但在出现功能保留或保护性情境时其低估症状证据的倾向，需在临床部署前进行谨慎验证。