Multimodal sentiment analysis (MSA) finds extensive applications, but the presence of missing modalities in real-world environments requires researchers to enhance the robustness of models, often demanding significant efforts. Multimodal neural architecture search (MNAS) is a more efficient approach. However, current MNAS methods, while effective in integrating multi-level information, are incapable of simultaneously searching for optimal operations to extract modality-specific information. This weakens the robustness of the model in addressing diverse scenarios. Moreover, these methods also fall short in enhancing the capture of emotional cues. In this paper, we propose robust-sentiment multimodal neural architecture search (RMNAS) framework. Specifically, we utilize the Transformer as a unified architecture for various modalities and incorporate a search for token mixers to enhance the encoding capacity of individual modalities and improve robustness across diverse scenarios. Subsequently, we leverage BM-NAS to integrate multi-level information. Furthermore, we incorporate local sentiment variation trends to guide the token mixers computation, enhancing the model's ability to capture sentiment context. Experimental results demonstrate that our approach outperforms or competitively matches existing state-of-the-art approaches in incomplete multimodal learning, both in sentence-level and dialogue-level MSA tasks, without the need for knowledge of incomplete learning.
翻译:多模态情感分析(MSA)具有广泛的应用场景,但现实环境中存在的模态缺失问题要求研究者增强模型的鲁棒性,这往往需要耗费大量精力。多模态神经架构搜索(MNAS)是一种更高效的方法。然而,当前MNAS方法虽能有效整合多层级信息,却无法同步搜索提取特定模态信息的最优操作,这削弱了模型处理多样场景的鲁棒性。此外,这些方法在增强情感线索捕获方面也存在不足。本文提出鲁棒情感多模态神经架构搜索(RMNAS)框架。具体而言,我们采用Transformer作为各模态的统一架构,并引入令牌混合器的搜索以增强单模态编码能力,提升跨场景鲁棒性。随后,利用BM-NAS整合多层级信息。进一步地,我们融合局部情感变化趋势来引导令牌混合器的计算,增强模型捕获情感上下文的能力。实验结果表明,在无需缺失学习先验知识的情况下,本方法在句子级与对话级MSA任务的不完整多模态学习中均达到或超越现有最优方法的性能水平。