Dep-LLM: Training-Free Depression Diagnosis via Evidence-Guided Structured Multi-factor with Reliable LLM Reasoning

Automatic Depression Detection (ADD) from clinical interviews is a pivotal task in computational mental health, yet it remains challenging due to two critical obstacles: 1) difficulty in modeling complex but sparsely distributed depression clues within lengthy, multi-topic clinical interviews, leading to superficial and unreliable reasoning; 2) scarcity of labeled data due to clinical privacy, together with high cost of training and fine-tuning, limiting the deployment of supervised ADD systems. To jointly address these challenges, we propose Dep-LLM, a training-free framework that mirrors the step-by-step reasoning of clinical psychiatrists and operates entirely on frozen off-the-shelf foundation LLMs. Dep-LLM comprises three stages. First, a Chain-of-Thought (CoT) Depression Multi-factor Analysis module structurally decomposes the long dialogue into five clinically aligned themes and produces evidence-grounded rationales, effectively handling long-context dependencies. Second, we introduce Confidence Analysis and Modulation module that quantifies the epistemic reliability from token-level entropy of each rationale and applies an intra-label and inter-theme modulation that amplifies trustworthy signals while suppressing uncertain ones without extra training. Third, a Collaborative Multi-factor Prediction module dynamically integrates multi-factor signals weighted by confidence into the final diagnosis. Extensive experiments on the DAIC-WOZ and E-DAIC datasets demonstrate the effectiveness and generalizability of Dep-LLM: it surpasses zero-shot baseline on nearly all 21 foundation LLMs across 9 metrics such as accuracy, macro F1 and weighted-average F1, and further outperforms state-of-the-art supervised domain-specific LLMs as well as the latest closed-source commercial LLMs, while requiring no extra training.

翻译：从临床访谈中进行自动抑郁症检测是计算精神健康领域的关键任务，但因其面临两大关键障碍而仍然具有挑战性：1）难以对冗长、多主题临床访谈中复杂但稀疏分布的抑郁线索进行建模，导致推理浅层且不可靠；2）临床隐私导致的标注数据稀缺，加之训练和微调的高昂成本，限制了监督式抑郁症检测系统的部署。为共同应对这些挑战，我们提出了Dep-LLM，一种无训练框架，它模拟临床精神科医生的逐步推理过程，并完全基于冻结的现成基础大语言模型运行。Dep-LLM包含三个阶段。首先，思维链抑郁症多因素分析模块从结构上将长对话分解为五个临床对齐的主题，并生成基于证据的推理依据，有效处理长上下文依赖关系。其次，我们引入了置信度分析与调节模块，该模块从每个推理依据的词元级熵中量化认知可靠性，并应用标签内与主题间的调节机制，在不进行额外训练的情况下放大可信信号并抑制不确定信号。第三，协作式多因素预测模块将置信度加权的多因素信号动态整合为最终诊断结果。在DAIC-WOZ和E-DAIC数据集上的大量实验证明了Dep-LLM的有效性和泛化能力：它在21个基础大语言模型上的准确率、宏平均F1及加权平均F1等9项指标上几乎全面超越零样本基线，且进一步优于最先进的监督式领域专用大语言模型以及最新的闭源商业大语言模型，同时无需额外训练。