Multimodal aspect-based sentiment analysis (MABSA) aims to understand opinions in a granular manner, advancing human-computer interaction and other fields. Traditionally, MABSA methods use a joint prediction approach to identify aspects and sentiments simultaneously. However, we argue that joint models are not always superior. Our analysis shows that joint models struggle to align relevant text tokens with image patches, leading to misalignment and ineffective image utilization. In contrast, a pipeline framework first identifies aspects through MATE (Multimodal Aspect Term Extraction) and then aligns these aspects with image patches for sentiment classification (MASC: Multimodal Aspect-Oriented Sentiment Classification). This method is better suited for multimodal scenarios where effective image use is crucial. We present three key observations: (a) MATE and MASC have different feature requirements, with MATE focusing on token-level features and MASC on sequence-level features; (b) the aspect identified by MATE is crucial for effective image utilization; and (c) images play a trivial role in previous MABSA methods due to high noise. Based on these observations, we propose a pipeline framework that first predicts the aspect and then uses translation-based alignment (TBA) to enhance multimodal semantic consistency for better image utilization. Our method achieves state-of-the-art (SOTA) performance on widely used MABSA datasets Twitter-15 and Twitter-17. This demonstrates the effectiveness of the pipeline approach and its potential to provide valuable insights for future MABSA research. For reproducibility, the code and checkpoint will be released.
翻译:多模态方面级情感分析(MABSA)旨在以细粒度方式理解观点,推动人机交互等领域的发展。传统MABSA方法采用联合预测方式同时识别方面和情感。然而,我们认为联合模型并非总是更优。分析表明,联合模型难以将相关文本词元与图像块对齐,导致错位和图像利用低效。相比之下,流水线框架首先通过多模态方面词提取(MATE)识别方面,随后将这些方面与图像块对齐以进行情感分类(MASC:多模态面向方面的情感分类)。这种方法更适合图像有效利用至关重要的多模态场景。我们提出三个关键发现:(a)MATE与MASC具有不同的特征需求,MATE关注词元级特征而MASC关注序列级特征;(b)MATE识别出的方面对有效利用图像至关重要;(c)由于高噪声干扰,图像在以往MABSA方法中作用有限。基于这些发现,我们提出一种流水线框架:先预测方面,再通过基于翻译的对齐(TBA)增强多模态语义一致性以优化图像利用。我们的方法在广泛使用的MABSA数据集Twitter-15和Twitter-17上取得了最先进的性能。这证明了流水线方法的有效性,并为未来MABSA研究提供了重要启示。为保障可复现性,代码与检查点将公开。