Multi-modal aspect-based sentiment classification (MABSC) is task of classifying the sentiment of a target entity mentioned in a sentence and an image. However, previous methods failed to account for the fine-grained semantic association between the image and the text, which resulted in limited identification of fine-grained image aspects and opinions. To address these limitations, in this paper we propose a new approach called SeqCSG, which enhances the encoder-decoder sentiment classification framework using sequential cross-modal semantic graphs. SeqCSG utilizes image captions and scene graphs to extract both global and local fine-grained image information and considers them as elements of the cross-modal semantic graph along with tokens from tweets. The sequential cross-modal semantic graph is represented as a sequence with a multi-modal adjacency matrix indicating relationships between elements. Experimental results show that the approach outperforms existing methods and achieves state-of-the-art performance on two standard datasets. Further analysis has demonstrated that the model can implicitly learn the correlation between fine-grained information of the image and the text with the given target. Our code is available at https://github.com/zjukg/SeqCSG.
翻译:多模态方面级情感分类(MABSC)任务是判断句子和图像中提及的目标实体的情感倾向。然而,现有方法未能充分考虑图像与文本之间的细粒度语义关联,导致对图像中细粒度方面和观点的识别能力有限。为解决这些问题,本文提出一种名为SeqCSG的新方法,该方法利用时序跨模态语义图增强编码器-解码器情感分类框架。SeqCSG通过图像描述和场景图提取图像的全局与局部细粒度信息,并将其与推文中的词元共同作为跨模态语义图的元素。时序跨模态语义图以序列形式表示,并利用多模态邻接矩阵刻画元素间的关联关系。实验结果表明,该方法在两个标准数据集上均优于现有方法,达到当前最优性能。进一步分析证实,模型能够隐式学习图像与文本中细粒度信息关于给定目标的相关性。我们的代码开源在https://github.com/zjukg/SeqCSG。