Large language models (LLMs) are increasingly being used in a zero-shot fashion to assess mental health conditions, yet we have limited knowledge on what factors affect their accuracy. In this study, we utilize a clinical dataset of natural language narratives and self-reported PTSD severity scores from 1,437 individuals to comprehensively evaluate the performance of 11 state-of-the-art LLMs. To understand the factors affecting accuracy, we systematically varied (i) contextual knowledge like subscale definitions, distribution summary, and interview questions, and (ii) modeling strategies including zero-shot vs few shot, amount of reasoning effort, model sizes, structured subscales vs direct scalar prediction, output rescaling and nine ensemble methods. Our findings indicate that (a) LLMs are most accurate when provided with detailed construct definitions and context of the narrative; (b) increased reasoning effort leads to better estimation accuracy; (c) performance of open-weight models (Llama, Deepseek), plateau beyond 70B parameters while closed-weight (o3-mini, gpt-5) models improve with newer generations; and (d) best performance is achieved when ensembling a supervised model with the zero-shot LLMs. Taken together, the results suggest choice of contextual knowledge and modeling strategies is important for deploying LLMs to accurately assess mental health.
翻译:大型语言模型正日益以零样本方式用于评估心理健康状况,但我们对其准确性影响因素的认识仍有限。本研究利用包含1,437名个体的自然语言叙述与自述创伤后应激障碍严重程度评分的临床数据集,对11个前沿大型语言模型进行了全面评估。为探究影响准确性的因素,我们系统性地调整了(i)情境知识(如子量表定义、分布摘要和访谈问题)与(ii)建模策略(包括零样本与少样本学习、推理强度、模型规模、结构化子量表与直接标量预测、输出重缩放及九种集成方法)。研究结果表明:(a)当提供详细构念定义和叙述情境时,大型语言模型最为准确;(b)增强推理强度可提升评估准确性;(c)开源权重模型(Llama、Deepseek)在参数量超过700亿后性能趋于稳定,而闭源权重模型(o3-mini、gpt-5)则随代际更新持续提升;(d)将有监督模型与零样本大型语言模型集成时可获得最佳性能。综上所述,研究结果表明情境知识的选择与建模策略的制定对于部署大型语言模型以准确评估心理健康具有重要作用。