This study focuses on how different modalities of human communication can be used to distinguish between healthy controls and subjects with schizophrenia who exhibit strong positive symptoms. We developed a multi-modal schizophrenia classification system using audio, video, and text. Facial action units and vocal tract variables were extracted as low-level features from video and audio respectively, which were then used to compute high-level coordination features that served as the inputs to the audio and video modalities. Context-independent text embeddings extracted from transcriptions of speech were used as the input for the text modality. The multi-modal system is developed by fusing a segment-to-session-level classifier for video and audio modalities with a text model based on a Hierarchical Attention Network (HAN) with cross-modal attention. The proposed multi-modal system outperforms the previous state-of-the-art multi-modal system by 8.53% in the weighted average F1 score.
翻译:本研究聚焦于如何利用人类沟通的多模态信息,区分健康对照组与表现出强烈阳性症状的精神分裂症患者。我们构建了一个基于音频、视频和文本的多模态精神分裂症分类系统。从视频和音频中分别提取面部动作单元与声道变量作为低层特征,进而计算高层协调特征作为音视频模态的输入。基于语音转写文本提取的上下文无关文本嵌入作为文本模态的输入。通过融合面向视频和音频模态的片段-会话层级分类器,以及基于层级注意力网络(HAN)并结合跨模态注意力的文本模型,构建多模态系统。所提出的多模态系统在加权平均F1分数上较先前最优多模态系统提升8.53%。