Multimodal Sentiment Analysis (MSA) leverages multiple data modals to analyze human sentiment. Existing MSA models generally employ cutting-edge multimodal fusion and representation learning-based methods to promote MSA capability. However, there are two key challenges: (i) in existing multimodal fusion methods, the decoupling of modal combinations and tremendous parameter redundancy, lead to insufficient fusion performance and efficiency; (ii) a challenging trade-off exists between representation capability and computational overhead in unimodal feature extractors and encoders. Our proposed GSIFN incorporates two main components to solve these problems: (i) a graph-structured and interlaced-masked multimodal Transformer. It adopts the Interlaced Mask mechanism to construct robust multimodal graph embedding, achieve all-modal-in-one Transformer-based fusion, and greatly reduce the computational overhead; (ii) a self-supervised learning framework with low computational overhead and high performance, which utilizes a parallelized LSTM with matrix memory to enhance non-verbal modal features for unimodal label generation. Evaluated on the MSA datasets CMU-MOSI, CMU-MOSEI, and CH-SIMS, GSIFN demonstrates superior performance with significantly lower computational overhead compared with previous state-of-the-art models.
翻译:多模态情感分析(MSA)利用多种数据模态来分析人类情感。现有的MSA模型通常采用前沿的多模态融合与基于表征学习的方法来提升MSA能力。然而,存在两个关键挑战:(i)在现有多模态融合方法中,模态组合的解耦与巨大的参数冗余导致融合性能与效率不足;(ii)在单模态特征提取器与编码器中,表征能力与计算开销之间存在难以权衡的取舍。我们提出的GSIFN包含两个主要组件以解决这些问题:(i)一种图结构与交错掩码的多模态Transformer。它采用交错掩码机制构建鲁棒的多模态图嵌入,实现基于Transformer的全模态一体化融合,并大幅降低计算开销;(ii)一种低计算开销、高性能的自监督学习框架,其利用带矩阵记忆的并行化LSTM增强非语言模态特征以生成单模态标签。在MSA数据集CMU-MOSI、CMU-MOSEI和CH-SIMS上的评估表明,与先前最先进的模型相比,GSIFN在显著降低计算开销的同时展现出更优的性能。