Depression is the leading cause of disability worldwide, and early detection of symptom change is essential for timely intervention. Validated instruments such as the Patient Health Questionnaire-9 (PHQ-9) support symptom monitoring at scale, but real-world completion rates are low, introducing response bias and systematic missingness. Passive approaches that infer severity from routinely generated data could close this gap. We address this by predicting PHQ-9 total scores directly from transcripts of conversations between users and an AI mental health application, requiring only conversation text and no additional clinical data. We fine-tune a Qwen3.5-27B backbone with a regression head, augment 3,111 ground-truth labels with pseudolabels generated by a reasoning model (Claude Opus) and iteratively trained intermediate models, for a combined dataset of 6,283 users. On a held-out test set of 842 users, our best model achieves MAE = 2.6, RMSE = 4.0, Pearson r = 0.80, and AUC = 0.91 at the PHQ-9 >= 10 clinical threshold. We also find AUC > 0.87 at every severity threshold from PHQ-9 >= 3 to PHQ-9 >= 24, demonstrating that the model captures depression severity across the full clinical spectrum. This work opens the door to passive, continuous symptom monitoring in AI mental health platforms, without requiring users to complete self-report measures.
翻译:抑郁症是全球致残的首要原因,早期发现症状变化对及时干预至关重要。患者健康问卷-9项(PHQ-9)等经过验证的工具支持大规模症状监测,但实际完成率较低,导致应答偏倚和系统性数据缺失。通过常规生成数据推断严重程度的被动方法有望弥补这一缺口。本研究通过直接预测用户与AI心理健康应用对话记录中的PHQ-9总分来解决这一问题,仅需对话文本且无需额外临床数据。我们采用带有回归头的Qwen3.5-27B骨干网络进行微调,利用推理模型(Claude Opus)和迭代训练的中间模型生成的伪标签扩充3,111个真实标签,形成包含6,283名用户的联合数据集。在842名用户的保留测试集上,我们的最佳模型在PHQ-9 >= 10临床阈值下达到了MAE = 2.6、RMSE = 4.0、Pearson r = 0.80、AUC = 0.91的效果。同时,从PHQ-9 >= 3到PHQ-9 >= 24的每个严重程度阈值下AUC均大于0.87,表明该模型能够捕捉整个临床谱系的抑郁症严重程度。本工作为AI心理健康平台实现被动式连续症状监测开辟了道路,无需用户完成自我报告量表。