Hierarchical Self-Supervised Representation Learning for Depression Detection from Speech

Speech-based depression detection (SDD) has emerged as a non-invasive and scalable alternative to conventional clinical assessments. However, existing methods still struggle to capture robust depression-related speech characteristics, which are sparse and heterogeneous. Although pretrained self-supervised learning (SSL) models provide rich representations, most recent SDD studies extract features from a single layer of the pretrained SSL model for the downstream classifier. This practice overlooks the complementary roles of low-level acoustic features and high-level semantic information inherently encoded in different SSL model layers. To explicitly model interactions between acoustic and semantic representations within an utterance, we propose a hierarchical adaptive representation encoder with prior knowledge that disengages and re-aligns acoustic and semantic information through asymmetric cross-attention, enabling fine-grained acoustic patterns to be interpreted in semantic context. In addition, a Connectionist Temporal Classification (CTC) objective is applied as auxiliary supervision to handle the irregular temporal distribution of depressive characteristics without requiring frame-level annotations. Experiments on DAIC-WOZ and MODMA demonstrate that HAREN-CTC consistently outperforms existing methods under both performance upper-bound evaluation and generalization evaluation settings, achieving Macro F1 scores of 0.81 and 0.82 respectively in upper-bound evaluation, and maintaining superior performance with statistically significant improvements in precision and AUC under rigorous cross-validation. These findings suggest that modeling hierarchical acoustic-semantic interactions better reflects how depressive characteristics manifest in natural speech, enabling scalable and objective depression assessment.

翻译：基于语音的抑郁症检测已成为传统临床评估的一种非侵入性、可扩展的替代方案。然而，现有方法仍难以捕捉稀疏且异质性的稳健抑郁相关语音特征。尽管预训练的自监督学习模型提供了丰富的表征，但近期大多数SDD研究仅从预训练SSL模型的单一层提取特征用于下游分类器。这种做法忽略了不同SSL模型层中固有编码的低层级声学特征与高层级语义信息之间的互补作用。为显式建模话语内声学与语义表征间的交互，我们提出一种具有先验知识的分层自适应表征编码器，通过非对称交叉注意力解耦并重新对齐声学与语义信息，使细粒度声学模式能在语义上下文中得到解释。此外，引入连接时序分类目标作为辅助监督，以处理抑郁特征的不规则时间分布，且无需帧级标注。在DAIC-WOZ和MODMA数据集上的实验表明，HAREN-CTC在性能上限评估和泛化评估设置下均持续优于现有方法，在上限评估中分别取得0.81和0.82的宏F1分数，并在严格交叉验证中保持卓越性能，其精确度与AUC指标均获得统计学显著提升。这些发现表明，对分层声学-语义交互的建模能更好地反映抑郁特征在自然语音中的表现方式，为实现可扩展、客观的抑郁症评估提供了新途径。