Large language models (LLMs) excel on many NLP benchmarks, but their behavior on real-world, semi-structured prediction remains underexplored. We present LlaMADRS, a benchmark for structured clinical assessment from dialogue built on the CAMI corpus of psychiatric interviews, comprising 5,804 expert annotations across 541 sessions. We evaluate 25 open-source models (standard and reasoning-augmented; 0.6B--400B parameters) and generate over 400,000 predictions. Our results demonstrate that strong open-source LLMs achieve item-level accuracy with residual error below clinically substantial thresholds. Additionally, an Item-then-Sum (ItS) strategy, assessing symptoms individually through discrete LLM calls before synthesizing final scores, significantly reduces error relative to Direct Total Score (DTS) prediction across most model architectures and scales, despite reasoning models attempting similar decomposition in the reasoning traces of their DTS predictions. In fact, we find that performance gains attributed to "reasoning" depend fundamentally on prompt design: standard models equipped with structured task definitions and examples match reasoning-augmented counterparts. Among the latter, longer reasoning traces correlate with reduced error; while higher model scale does across both architectures. Our results clarify when and why reasoning helps and offer actionable guidance for deploying LLMs in semi-structured clinical assessment.
翻译:大型语言模型(LLM)在许多NLP基准测试中表现出色,但其在真实世界半结构化预测任务中的行为仍待深入探究。我们提出LlaMADRS基准,该基准基于CAMI精神病学访谈语料库构建用于结构化临床评估,包含541次访谈会话中的5804条专家标注。我们对25个开源模型(包括标准模型与推理增强模型,参数规模从0.6B到400B)进行评测,生成超过40万次预测。结果表明,强大的开源大模型在逐项准确率上可达到低于临床显著阈值的残差误差。此外,相较于直接总分(DTS)预测,先逐项后求和策略——通过离散的LLM调用单独评估症状再合成最终分数——能在多数模型架构与规模下显著降低误差,尽管推理模型在其DTS预测的推理轨迹中已尝试类似分解。事实上,我们发现归因于"推理"的性能提升根本取决于提示设计:配备结构化任务定义与示例的标准模型可与推理增强模型相匹敌。在推理增强模型中,更长的推理轨迹与更低的误差相关;而在两种架构中,更大的模型规模均能降低误差。我们的研究结果厘清了推理在何时以及为何有效,并为在半结构化临床评估中部署LLM提供了可操作的指导。