Depression Detection and Analysis using Large Language Models on Textual and Audio-Visual Modalities

Depression has proven to be a significant public health issue, profoundly affecting the psychological well-being of individuals. If it remains undiagnosed, depression can lead to severe health issues, which can manifest physically and even lead to suicide. Generally, Diagnosing depression or any other mental disorder involves conducting semi-structured interviews alongside supplementary questionnaires, including variants of the Patient Health Questionnaire (PHQ) by Clinicians and mental health professionals. This approach places significant reliance on the experience and judgment of trained physicians, making the diagnosis susceptible to personal biases. Given that the underlying mechanisms causing depression are still being actively researched, physicians often face challenges in diagnosing and treating the condition, particularly in its early stages of clinical presentation. Recently, significant strides have been made in Artificial neural computing to solve problems involving text, image, and speech in various domains. Our analysis has aimed to leverage these state-of-the-art (SOTA) models in our experiments to achieve optimal outcomes leveraging multiple modalities. The experiments were performed on the Extended Distress Analysis Interview Corpus Wizard of Oz dataset (E-DAIC) corpus presented in the Audio/Visual Emotion Challenge (AVEC) 2019 Challenge. The proposed solutions demonstrate better results achieved by Proprietary and Open-source Large Language Models (LLMs), which achieved a Root Mean Square Error (RMSE) score of 3.98 on Textual Modality, beating the AVEC 2019 challenge baseline results and current SOTA regression analysis architectures. Additionally, the proposed solution achieved an accuracy of 71.43% in the classification task. The paper also includes a novel audio-visual multi-modal network that predicts PHQ-8 scores with an RMSE of 6.51.

翻译：抑郁症已被证实是一项重大的公共卫生问题，深刻影响个体的心理健康。若未能得到诊断，抑郁症可能导致严重的健康问题，表现为躯体症状甚至引发自杀。通常，抑郁症或任何其他精神障碍的诊断涉及由临床医生和心理健康专业人员进行的半结构化访谈及补充性问卷调查，包括患者健康问卷（PHQ）的多种变体。这种方法在很大程度上依赖于训练有素的医师的经验与判断，使得诊断易受个人主观偏见的影响。鉴于导致抑郁症的内在机制仍在积极研究中，医师在诊断和治疗该病症时常面临挑战，尤其是在其临床表现的早期阶段。近年来，人工神经计算在解决涉及文本、图像和语音的跨领域问题方面取得了显著进展。我们的分析旨在实验中利用这些最先进的模型，通过多模态融合实现最优结果。实验在2019年音频/视觉情感挑战赛（AVEC）中发布的扩展性苦恼分析访谈语料库Wizard of Oz数据集（E-DAIC）上进行。所提出的解决方案显示，专有及开源大语言模型取得了更优的结果：在文本模态上获得了3.98的均方根误差分数，超越了AVEC 2019挑战赛的基线结果及当前最先进的回归分析架构。此外，该方案在分类任务中达到了71.43%的准确率。本文还提出了一种新颖的视听多模态网络，其预测PHQ-8分数的均方根误差为6.51。