This study focuses on how different modalities of human communication can be used to distinguish between healthy controls and subjects with schizophrenia who exhibit strong positive symptoms. We developed a multi-modal schizophrenia classification system using audio, video, and text. Facial action units and vocal tract variables were extracted as low-level features from video and audio respectively, which were then used to compute high-level coordination features that served as the inputs to the audio and video modalities. Context-independent text embeddings extracted from transcriptions of speech were used as the input for the text modality. The multi-modal system is developed by fusing a segment-to-session-level classifier for video and audio modalities with a text model based on a Hierarchical Attention Network (HAN) with cross-modal attention. The proposed multi-modal system outperforms the previous state-of-the-art multi-modal system by 8.53% in the weighted average F1 score.
翻译:本研究聚焦于如何利用人类沟通的不同模态区分健康对照组与表现出强阳性症状的精神分裂症患者。我们开发了一套基于音频、视频和文本的多模态精神分裂症分类系统。从视频和音频中分别提取面部动作单元和声道变量作为低层特征,进而计算高层协调特征作为音频和视频模态的输入。从语音转录文本中提取的上下文无关文本嵌入作为文本模态的输入。通过将视频与音频模态的片段到会话级分类器与基于层级注意力网络(HAN)的跨模态注意力文本模型相融合,构建了多模态系统。所提出的多模态系统在加权平均F1分数上较此前最优多模态系统提升8.53%。