This paper audits benchmark evaluation in clinical-interview depression detection through four complementary probes across DAIC/E-DAIC, CMDC, ANDROIDS, MODMA, and PDCH. First, we re-evaluate E-DAIC under strict subject-disjoint leave-one-subject-out cross-validation. A lightweight hybrid text-plus-LLM-score model reaches macro-F1 = 0.723 - the highest reported under this protocol, to our knowledge - providing a conservative out-of-fold reference point that does not depend on the privileged official holdout. Second, we test whether the E-DAIC official split supports fine-grained leaderboard rankings by sweeping 96 model configurations across modality bundles, pooling strategies, and learners. Development-side cross-validation and official-test rankings align only moderately: the best cross-validation configuration ranks twentieth on the official test, the official-test winner ranks forty-first by cross-validation, top-3 overlap is zero, and the apparent winner is rank-1 in only 32.3% of subject bootstraps. Third, we externally validate strong public CMDC and ANDROIDS baselines that achieve near-ceiling in-domain performance. Zero-shot transfer to external corpora is substantially weaker. Finally, we stress-test E-DAIC text and audio models using paired symptom-dense versus symptom-light interview slices defined by an SRDS-based annotator. Text scores rise sharply on symptom-dense slices, whereas audio scores remain nearly flat; the text-minus-audio gap is positive across all five seeds.
翻译:本文通过针对DAIC/E-DAIC、CMDC、ANDROIDS、MODMA和PDCH五个数据集的四项互补探针,对临床访谈抑郁检测中的基准评估进行了审计。首先,我们在严格的严格受试者不重叠的留一受试者交叉验证下重新评估了E-DAIC。一种轻量级文本加LLM分数混合模型达到了宏F1值为0.723——据我们所知,这是该协议下的最高报告结果——提供了一个不依赖于特权官方保留集的保守折外参考点。其次,我们通过扫描96种跨模态组合、池化策略和学习器的模型配置,测试了E-DAIC官方划分是否支持细粒度排行榜排名。开发侧交叉验证与官方测试排名仅呈现中等程度对齐:最佳交叉验证配置在官方测试中排名第二十,官方测试胜出者在交叉验证中排名第四十一,前三名重叠为零,且表观胜出者在32.3%的受试者自助抽样中仅位列第一。第三,我们对在领域内达到接近天花板性能的强公开CMDC和ANDROIDS基线进行了外部验证。向外部语料库的零样本迁移表现显著较弱。最后,我们使用基于SRDS的标注器定义的症状密集与症状稀疏成对访谈片段,对E-DAIC文本和音频模型进行了压力测试。文本分数在症状密集片段上急剧上升,而音频分数几乎持平;在所有五个随机种子下,文本减去音频的差值均为正。