Automatic depression detection from doctor-patient conversations has gained momentum thanks to the availability of public corpora and advances in language modeling. However, interpretability remains limited: strong performance is often reported without revealing what drives predictions. We analyze three datasets: ANDROIDS, DAIC-WOZ, E-DAIC and identify a systematic bias from interviewer prompts in semi-structured interviews. Models trained on interviewer turns exploit fixed prompts and positions to distinguish depressed from control subjects, often achieving high classification scores without using participant language. Restricting models to participant utterances distributes decision evidence more broadly and reflects genuine linguistic cues. While semi-structured protocols ensure consistency, including interviewer prompts inflates performance by leveraging script artifacts. Our results highlight a cross-dataset, architecture-agnostic bias and emphasize the need for analyses that localize decision evidence by time and speaker to ensure models learn from participants' language.
翻译:从医患对话中自动检测抑郁症的研究因公共语料库的可用性和语言建模的进步而日益受到关注。然而,模型的可解释性仍然有限:许多研究仅报告了优异的性能,却未揭示驱动预测的因素。我们分析了三个数据集:ANDROIDS、DAIC-WOZ 和 E-DAIC,并识别出半结构化访谈中访谈者提示词带来的系统性偏差。基于访谈者轮次训练的模型利用固定提示词和位置信息来区分抑郁组与对照组,通常无需使用参与者的语言即可获得高分类得分。将模型限制于参与者的发言会使得决策证据分布更广泛,并反映真实的语言线索。尽管半结构化协议确保了访谈的一致性,但纳入访谈者提示词会因利用脚本人工痕迹而夸大模型性能。我们的结果揭示了跨数据集、与架构无关的偏差,并强调了需按时间和说话人定位决策证据的分析方法,以确保模型从参与者的语言中学习。