Multimodal Sentiment Analysis (MSA) integrates language, visual, and acoustic modalities to infer human sentiment. Most existing methods either focus on globally shared representations or modality-specific features, while overlooking signals that are shared only by certain modality pairs. This limits the expressiveness and discriminative power of multimodal representations. To address this limitation, we propose a Tri-Subspace Disentanglement (TSD) framework that explicitly factorizes features into three complementary subspaces: a common subspace capturing global consistency, submodally-shared subspaces modeling pairwise cross-modal synergies, and private subspaces preserving modality-specific cues. To keep these subspaces pure and independent, we introduce a decoupling supervisor together with structured regularization losses. We further design a Subspace-Aware Cross-Attention (SACA) fusion module that adaptively models and integrates information from the three subspaces to obtain richer and more robust representations. Experiments on CMU-MOSI and CMU-MOSEI demonstrate that TSD achieves state-of-the-art performance across all key metrics, reaching 0.691 MAE on CMU-MOSI and 54.9% ACC-7 on CMU-MOSEI, and also transfers well to multimodal intent recognition tasks. Ablation studies confirm that tri-subspace disentanglement and SACA jointly enhance the modeling of multi-granular cross-modal sentiment cues.
翻译:多模态情感分析(MSA)整合语言、视觉与听觉模态以推断人类情感。现有方法大多聚焦于全局共享表征或模态特异性特征,却忽略了仅由特定模态对共享的信号,这限制了多模态表征的表达能力与判别力。为解决此局限,本文提出一种三子空间解耦(TSD)框架,显式地将特征分解至三个互补子空间:捕获全局一致性的公共子空间、建模成对跨模态协同的子模态共享子空间,以及保持模态特异性线索的私有子空间。为确保这些子空间的纯净与独立,我们引入解耦监督器及结构化正则化损失。进一步设计了子空间感知交叉注意力(SACA)融合模块,自适应地建模并整合来自三个子空间的信息,以获得更丰富且更鲁棒的表征。在CMU-MOSI与CMU-MOSEI数据集上的实验表明,TSD在所有关键指标上均达到最先进性能,在CMU-MOSI上取得0.691 MAE,在CMU-MOSEI上取得54.9% ACC-7,并能良好迁移至多模态意图识别任务。消融研究证实,三子空间解耦与SACA共同增强了对多粒度跨模态情感线索的建模能力。