This study focuses on how different modalities of human communication can be used to distinguish between healthy controls and subjects with schizophrenia who exhibit strong positive symptoms. We developed a multi-modal schizophrenia classification system using audio, video, and text. Facial action units and vocal tract variables were extracted as low-level features from video and audio respectively, which were then used to compute high-level coordination features that served as the inputs to the audio and video modalities. Context-independent text embeddings extracted from transcriptions of speech were used as the input for the text modality. The multi-modal system is developed by fusing a segment-to-session-level classifier for video and audio modalities with a text model based on a Hierarchical Attention Network (HAN) with cross-modal attention. The proposed multi-modal system outperforms the previous state-of-the-art multi-modal system by 8.53% in the weighted average F1 score.
翻译:本研究聚焦于如何利用人类沟通的不同模态区分健康对照组与表现出强烈阳性症状的精神分裂症患者。我们开发了一种基于音频、视频和文本的多模态精神分裂症分类系统。从视频和音频中分别提取面部动作单元和声道变量作为低层特征,进而计算高层协调特征,作为音频和视频模态的输入。从语音转录文本中提取的上下文无关文本嵌入作为文本模态的输入。通过融合面向视频和音频模态的片段-会话级分类器与基于跨模态注意力的层次注意力网络(HAN)文本模型,构建了多模态系统。所提出的多模态系统在加权平均F1分数上比先前最先进的多模态系统提升了8.53%。