Current approaches to detecting depression and anxiety from speech primarily rely on machine learning techniques that utilize hand-engineered paralinguistic features and related acoustic descriptors derived from time- and frequency-domain representations of speech signals. Applying deep learning methods directly to raw speech signals has the potential to produce biomarker representations with substantially greater predictive power. However, these approaches typically require large volumes of carefully annotated data to learn robust and clinically meaningful representations of the underlying biomarkers. In this paper, we describe our efforts toward developing a deep learning model trained on a large-scale proprietary dataset comprising ~65,000 utterances collected from more than 23,000 subjects representative of relevant United States demographics. We present the techniques employed and analyze their impact on model performance. Our results demonstrate that the proposed models can extract content-agnostic biomarker information, which, when combined with lexical features extracted from audio, yields improved predictive performance in production settings. Our models are evaluated on ~5000 unique subjects and achieve performance of 71% in terms of sensitivity and specificity. To foster further research in mental health assessment from speech, we release the best-performing model described in this paper on HuggingFace.
翻译:[翻译摘要] 当前从语音中检测抑郁和焦虑的方法主要依赖机器学习技术,这些技术利用手工工程化的副语言特征及从语音信号的时域和频域表示中导出的相关声学描述符。将深度学习方法直接应用于原始语音信号,有可能生成预测能力更强的生物标志物表示。然而,这些方法通常需要大量精心标注的数据来学习鲁棒且具有临床意义的底层生物标志物表示。本文描述了我们在开发深度学习模型方面的工作,该模型基于大规模专有数据集进行训练,该数据集包含来自23,000余名受试者的约65,000条语音样本,这些受试者代表了美国相关人口统计学特征。我们介绍了所采用的技术并分析了其对模型性能的影响。结果表明,所提模型能够提取内容无关的生物标志物信息;当这些信息与从音频中提取的词汇特征相结合时,可在实际生产环境中提升预测性能。我们在约5,000名独立受试者上评估模型,实现了71%的灵敏度和特异度性能。为促进语音心理健康评估领域的研究,我们将本文描述的最佳性能模型发布在HuggingFace平台上。