Depression is the most common psychological disorder and is considered as a leading cause of disability and suicide worldwide. An automated system capable of detecting signs of depression in human speech can contribute to ensuring timely and effective mental health care for individuals suffering from the disorder. Developing such automated system requires accurate machine learning models, capable of capturing signs of depression. However, state-of-the-art models based on deep acoustic representations require abundant data, meticulous selection of features, and rigorous training; the procedure involves enormous computational resources. In this work, we explore the effectiveness of two different acoustic feature groups - conventional hand-curated and deep representation features, for predicting the severity of depression from speech. We explore the relevance of possible contributing factors to the models' performance, including gender of the individual, severity of the disorder, content and length of speech. Our findings suggest that models trained on conventional acoustic features perform equally well or better than the ones trained on deep representation features at significantly lower computational cost, irrespective of other factors, e.g. content and length of speech, gender of the speaker and severity of the disorder. This makes such models a better fit for deployment where availability of computational resources is restricted, such as real time depression monitoring applications in smart devices.
翻译:抑郁症是最常见的心理障碍,被视为全球范围内导致残疾和自杀的主要原因之一。能够从人类语音中检测抑郁症迹象的自动化系统,有助于为患者提供及时有效的心理健康护理。开发此类系统需要能够捕捉抑郁症迹象的精确机器学习模型。然而,基于深度声学表征的先进模型需要大量数据、精细的特征选择和严格的训练,整个过程消耗大量计算资源。本文探讨了两类不同声学特征组——传统手工特征与深度表征特征——在从语音中预测抑郁症严重程度方面的有效性。我们分析了影响模型性能的可能因素,包括个体性别、疾病严重程度、语音内容及长度。研究结果表明,无论语音内容、长度、说话者性别及疾病严重程度等因素如何,基于传统声学特征训练的模型在计算成本显著降低的情况下,其性能与基于深度表征特征训练的模型相当甚至更优。这使得此类模型更适用于计算资源受限的场景,例如智能设备中的实时抑郁症监测应用。