A Systematic Evaluation of Large Language Models for PTSD Severity Estimation: The Role of Contextual Knowledge and Modeling Strategies

Panagiotis Kaliosis,Adithya V Ganesan,Oscar N. E. Kjell,Whitney Ringwald,Scott Feltman,Melissa A. Carr,Dimitris Samaras,Camilo Ruggero,Benjamin J. Luft,Roman Kotov,Andrew H. Schwartz

from arxiv, 24 pages, 5 figures, 5 tables

Large language models (LLMs) are increasingly being used in a zero-shot (generative) fashion to assess mental health conditions, yet we have limited knowledge on what factors affect their accuracy. In this study, we use a clinical dataset of natural language narratives and self-reported PTSD severity scores from 1,437 individuals to comprehensively evaluate the performance of 11 state-of-the-art LLMs. To understand the factors affecting model's assessment accuracy, we systematically varied (i) contextual knowledge prompted to the models like subscale definitions, distribution summary, and interview questions, and (ii) modeling strategies including zero-shot vs few shot, amount of reasoning effort, model sizes, structured subscales vs direct scalar prediction, output rescaling and nine ensemble methods. Our findings indicate that (a) LLMs are most accurate when provided with detailed construct definitions and context of the narrative, even exceeding human raters agreement with self-reported scores; (b) increased reasoning effort leads to better estimation accuracy; (c) performance of open-weight models (Llama, DeepSeek) plateaus beyond 70B parameters while closed-weight (gpt-o3-mini, gpt-5) alternatives improve with newer generations; and (d) best performance is achieved when ensembling a supervised model with the zero-shot LLMs. Beyond agreement with self-reports, LLMs' estimates discriminated PTSD severity from depression, anxiety, and alcohol use, and prospectively predicted future mental healthcare expenditure. Together, these results suggest that contextual knowledge and modeling strategies meaningfully affect accuracy and clinical utility of LLM-based assessments of PTSD severity.

翻译：大语言模型（LLMs）正被越来越多地以零样本（生成式）方式用于心理健康状况评估，然而我们对影响其准确性的因素知之甚少。本研究利用来自1,437名个体的自然语言叙述临床数据集及其自我报告的PTSD严重程度评分，系统评估了11个最先进大语言模型的性能。为探究影响模型评估准确性的因素，我们系统性地变化了：（i）提供给模型的上下文知识，如子量表定义、分布概况和访谈问题；（ii）建模策略，包括零样本与小样本学习、推理努力程度、模型规模、结构化子量表与直接标量预测、输出重缩放以及九种集成方法。研究结果表明：（a）当提供详细构念定义和叙述上下文时，LLM的准确性最高，甚至超过人类评分者与自我报告评分的一致性；（b）增加推理努力程度可提高评估准确性；（c）开放权重模型（Llama、DeepSeek）在超过700亿参数后性能趋于平稳，而封闭权重模型（gpt-o3-mini、gpt-5）则随新版本迭代持续提升；（d）将监督学习模型与零样本LLM集成时达到最佳性能。除与自我报告的一致性外，LLM评估还能区分PTSD严重程度与抑郁、焦虑及酒精使用障碍，并可前瞻性预测未来精神卫生医疗支出。综合而言，这些结果表明上下文知识和建模策略显著影响基于LLM的PTSD严重程度评估的准确性与临床实用性。