The emergence of multimodal data on social media platforms presents new opportunities to better understand user sentiments toward a given aspect. However, existing multimodal datasets for Aspect-Category Sentiment Analysis (ACSA) often focus on textual annotations, neglecting fine-grained information in images. Consequently, these datasets fail to fully exploit the richness inherent in multimodal. To address this, we introduce a new Vietnamese multimodal dataset, named ViMACSA, which consists of 4,876 text-image pairs with 14,618 fine-grained annotations for both text and image in the hotel domain. Additionally, we propose a Fine-Grained Cross-Modal Fusion Framework (FCMF) that effectively learns both intra- and inter-modality interactions and then fuses these information to produce a unified multimodal representation. Experimental results show that our framework outperforms SOTA models on the ViMACSA dataset, achieving the highest F1 score of 79.73%. We also explore characteristics and challenges in Vietnamese multimodal sentiment analysis, including misspellings, abbreviations, and the complexities of the Vietnamese language. This work contributes both a benchmark dataset and a new framework that leverages fine-grained multimodal information to improve multimodal aspect-category sentiment analysis. Our dataset is available for research purposes: https://github.com/hoangquy18/Multimodal-Aspect-Category-Sentiment-Analysis.
翻译:社交媒体平台上多模态数据的涌现为理解用户对特定方面的情感提供了新机遇。然而,现有用于方面类别情感分析(ACSA)的多模态数据集通常聚焦于文本标注,忽略了图像中的细粒度信息,导致未能充分挖掘多模态数据的丰富性。为解决此问题,我们构建了名为ViMACSA的越南语多模态数据集,包含酒店领域4,876个文本-图像对及14,618个针对文本和图像的细粒度标注。此外,我们提出细粒度跨模态融合框架(FCMF),该框架能有效学习模态内与模态间交互,进而融合这些信息以生成统一的多模态表征。实验结果表明,所提框架在ViMACSA数据集上超越当前最优模型,取得了79.73%的最高F1分数。我们还探究了越南语多模态情感分析的特性与挑战,包括拼写错误、缩略词及越南语的复杂性。本研究同时贡献了基准数据集与利用细粒度多模态信息改进多模态方面类别情感分析的新框架。数据集已公开供研究使用:https://github.com/hoangquy18/Multimodal-Aspect-Category-Sentiment-Analysis