When Symptoms Are Not Enough: Evidence-Weighting Patterns in Large Language Model Psychiatric Screening

As demand for mental health care outpaces clinician-delivered assessment, scalable screening tools are increasingly needed. Large language models (LLMs) may identify psychiatric risk from patient narratives, but their reliability across diagnoses, demographic subgroups, and evidence-use patterns remains uncertain. We introduce a SCID-anchored benchmark of 555 semi-structured experiential interviews paired with diagnostic reference labels for anxiety disorder, major depressive disorder, post-traumatic stress disorder, and any current mental health disorder. Using zero-shot task-specific prompting, we evaluated five state-of-the-art LLMs and examined whether false-negative errors reflected missed psychiatric evidence or differential weighting of symptom, functional-impairment, and protective-context cues. Performance varied across tasks and models, with accuracy ranging from 0.49 to 0.86 and Matthews correlation coefficients from 0.16 to 0.38. GPT-4.1 Mini and GPT-5 Mini showed the most consistent disorder-specific accuracy. Subgroup analyses found higher depression-classification accuracy among male than female participants, no consistent age-related pattern, and modest non-uniform variation across race strata. Evidence-integration analyses showed that false-negative anxiety and PTSD classifications often contained explicit symptom evidence but were accompanied by preserved functioning, coping ability, or social support. Functional-impairment evidence shifted model outputs toward positive classifications, whereas protective-context evidence shifted outputs away. These findings suggest that LLMs may support scalable psychiatric screening, but their tendency to discount symptom evidence in the presence of preserved functioning or protective context requires careful validation before clinical deployment.

翻译：随着心理健康护理需求超过临床医生的评估能力，可扩展的筛查工具日益迫切。大语言模型（LLMs）能从患者叙述中识别精神风险，但其跨诊断类别、人口统计亚组及证据使用模式的可靠性仍不确定。我们构建了一个基于SCID锚定的基准数据集，包含555次半结构化体验性访谈及对应的焦虑症、重度抑郁症、创伤后应激障碍和当前任一心理健康障碍的诊断参考标签。采用零样本任务特定提示，评估了五种最先进的LLMs，并考察了假阴性错误是源于遗漏精神证据，还是对症状、功能损害及保护性情境线索的差异化加权。模型与任务间的性能存在差异，准确率范围为0.49至0.86，马修斯相关系数范围为0.16至0.38。GPT-4.1 Mini和GPT-5 Mini在特定障碍分类中展现了最一致的准确率。亚组分析发现，抑郁症分类中男性参与者准确率高于女性，无一致年龄相关模式，种族阶层间存在适度非均匀变异。证据整合分析显示，假阴性焦虑和PTSD分类常包含明确的症状证据，但伴随功能保留、应对能力或社会支持。功能损害证据促使模型输出偏向阳性分类，而保护性情境证据则使输出偏离。这些发现表明，LLMs可支持可扩展的精神筛查，但其在存在功能保留或保护性情境时忽视症状证据的倾向，需在临床应用前进行审慎验证。

相关内容

MoDELS

关注 46

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

基于大语言模型的医疗推理研究：综述与 MR-Bench 基准测试

专知会员服务

16+阅读 · 4月13日

【斯坦福博士论文】提升大语言模型知识获取的可信度

专知会员服务

24+阅读 · 3月7日

评估大语言模型在科学发现中的作用

专知会员服务

19+阅读 · 2025年12月19日

大型语言模型的规模效应局限

专知会员服务

14+阅读 · 2025年11月18日