Depression, a prominent contributor to global disability, affects a substantial portion of the population. Efforts to detect depression from social media texts have been prevalent, yet only a few works explored depression detection from user-generated video content. In this work, we address this research gap by proposing a simple and flexible multi-modal temporal model capable of discerning non-verbal depression cues from diverse modalities in noisy, real-world videos. We show that, for in-the-wild videos, using additional high-level non-verbal cues is crucial to achieving good performance, and we extracted and processed audio speech embeddings, face emotion embeddings, face, body and hand landmarks, and gaze and blinking information. Through extensive experiments, we show that our model achieves state-of-the-art results on three key benchmark datasets for depression detection from video by a substantial margin. Our code is publicly available on GitHub.
翻译:抑郁症作为导致全球残疾的主要因素之一,影响着相当比例的人口。尽管从社交媒体文本中检测抑郁情绪的研究已较为普遍,但利用用户生成视频内容进行抑郁症检测的工作仍相对较少。本研究通过提出一种简单且灵活的多模态时序模型来解决这一研究空白,该模型能够从嘈杂的真实世界视频中识别多模态非语言抑郁线索。我们发现,对于野外视频而言,利用额外的高层非语言线索对实现良好性能至关重要,并提取并处理了音频语音嵌入、面部表情嵌入、面部/身体/手部关键点、注视及眨眼信息。通过大量实验证明,我们的模型在三个关键抑郁症视频检测基准数据集上以显著优势取得了最先进成果。相关代码已在GitHub上公开。