Multimodal sentiment analysis is an important area for understanding the user's internal states. Deep learning methods were effective, but the problem of poor interpretability has gradually gained attention. Previous works have attempted to use attention weights or vector distributions to provide interpretability. However, their explanations were not intuitive and can be influenced by different trained models. This study proposed a novel approach to provide interpretability by converting nonverbal modalities into text descriptions and by using large-scale language models for sentiment predictions. This provides an intuitive approach to directly interpret what models depend on with respect to making decisions from input texts, thus significantly improving interpretability. Specifically, we convert descriptions based on two feature patterns for the audio modality and discrete action units for the facial modality. Experimental results on two sentiment analysis tasks demonstrated that the proposed approach maintained, or even improved effectiveness for sentiment analysis compared to baselines using conventional features, with the highest improvement of 2.49% on the F1 score. The results also showed that multimodal descriptions have similar characteristics on fusing modalities as those of conventional fusion methods. The results demonstrated that the proposed approach is interpretable and effective for multimodal sentiment analysis.
翻译:多模态情感分析是理解用户内部状态的重要研究领域。深度学习方法虽效果显著,但其可解释性差的问题逐渐引发关注。以往研究尝试利用注意力权重或向量分布提供可解释性,然而这些解释不够直观,且易受不同训练模型影响。本研究提出一种创新方法,通过将非语言模态转换为文本描述,并利用大规模语言模型进行情感预测,从而提供直观的解释途径,直接揭示模型基于输入文本的决策依据,显著提升可解释性。具体而言,我们针对音频模态基于两种特征模式、针对面部模态基于离散动作单元进行描述转换。两项情感分析任务的实验结果表明,与传统特征基线方法相比,所提方法在情感分析效果上保持甚至有所提升,其中F1分数最高提升2.49%。结果还显示,多模态描述在多模态融合特性上与常规融合方法具有相似特征。研究表明,所提方法在多模态情感分析中兼具可解释性与有效性。