Multimodal Sentiment Analysis integrates Linguistic, Visual, and Acoustic. Mainstream approaches based on modality-invariant and modality-specific factorization or on complex fusion still rely on spatiotemporal mixed modeling. This ignores spatiotemporal heterogeneity, leading to spatiotemporal information asymmetry and thus limited performance. Hence, we propose TSDA, Temporal-Spatial Decouple before Act, which explicitly decouples each modality into temporal dynamics and spatial structural context before any interaction. For every modality, a temporal encoder and a spatial encoder project signals into separate temporal and spatial body. Factor-Consistent Cross-Modal Alignment then aligns temporal features only with their temporal counterparts across modalities, and spatial features only with their spatial counterparts. Factor specific supervision and decorrelation regularization reduce cross factor leakage while preserving complementarity. A Gated Recouple module subsequently recouples the aligned streams for task. Extensive experiments show that TSDA outperforms baselines. Ablation analysis studies confirm the necessity and interpretability of the design.
翻译:多模态情感分析融合了语言、视觉和声学模态。基于模态不变与模态特定因子分解的主流方法或基于复杂融合的方法,仍依赖于时空混合建模。这忽略了时空异质性,导致时空信息不对称,从而限制了性能。因此,我们提出TSDA(时序-空间解耦后激活),该方法在任意交互之前,显式地将每个模态解耦为时序动态和空间结构上下文。对于每个模态,一个时序编码器和一个空间编码器将信号分别投影到时序主体和空间主体中。随后,因子一致跨模态对齐仅将时序特征与跨模态的时序对应特征对齐,空间特征仅与空间对应特征对齐。因子特定监督与去相关正则化减少了跨因子泄漏,同时保持了互补性。一个门控重耦合模块随后将对齐后的流重耦合以用于任务。大量实验表明,TSDA优于基线方法。消融分析研究证实了该设计的必要性和可解释性。