Depression is a common mental disorder. Automatic depression detection tools using speech, enabled by machine learning, help early screening of depression. This paper addresses two limitations that may hinder the clinical implementations of such tools: noise resulting from segment-level labelling and a lack of model interpretability. We propose a bi-modal speech-level transformer to avoid segment-level labelling and introduce a hierarchical interpretation approach to provide both speech-level and sentence-level interpretations, based on gradient-weighted attention maps derived from all attention layers to track interactions between input features. We show that the proposed model outperforms a model that learns at a segment level ($p$=0.854, $r$=0.947, $F1$=0.897 compared to $p$=0.732, $r$=0.808, $F1$=0.768). For model interpretation, using one true positive sample, we show which sentences within a given speech are most relevant to depression detection; and which text tokens and Mel-spectrogram regions within these sentences are most relevant to depression detection. These interpretations allow clinicians to verify the validity of predictions made by depression detection tools, promoting their clinical implementations.
翻译:抑郁是一种常见的精神障碍。基于机器学习的语音自动抑郁检测工具有助于抑郁症的早期筛查。本文针对此类工具在临床应用中可能存在的两个局限性——片段级标注导致的噪声以及模型可解释性不足——提出解决方案。我们构建了双模态语音级Transformer以避免片段级标注,并引入层级解释方法,通过从所有注意力层提取的梯度加权注意力图谱追踪输入特征间的交互,实现语音级与语句级双重解释。实验表明,所提模型优于片段级学习模型(模型性能对比:$p$=0.854, $r$=0.947, $F1$=0.897 vs. $p$=0.732, $r$=0.808, $F1$=0.768)。在模型解释方面,通过单例真阳性样本验证,我们展示了给定语音中与抑郁检测最相关的语句,以及这些语句内与抑郁检测最相关的文本标记和梅尔频谱图区域。这类解释机制使临床医生能够验证抑郁检测工具预测结果的可靠性,从而推动其临床转化应用。