Multimodal Sarcasm Understanding (MSU) has a wide range of applications in the news field such as public opinion analysis and forgery detection. However, existing MSU benchmarks and approaches usually focus on sentence-level MSU. In document-level news, sarcasm clues are sparse or small and are often concealed in long text. Moreover, compared to sentence-level comments like tweets, which mainly focus on only a few trends or hot topics (e.g., sports events), content in the news is considerably diverse. Models created for sentence-level MSU may fail to capture sarcasm clues in document-level news. To fill this gap, we present a comprehensive benchmark for Document-level Multimodal Sarcasm Understanding (DocMSU). Our dataset contains 102,588 pieces of news with text-image pairs, covering 9 diverse topics such as health, business, etc. The proposed large-scale and diverse DocMSU significantly facilitates the research of document-level MSU in real-world scenarios. To take on the new challenges posed by DocMSU, we introduce a fine-grained sarcasm comprehension method to properly align the pixel-level image features with word-level textual features in documents. Experiments demonstrate the effectiveness of our method, showing that it can serve as a baseline approach to the challenging DocMSU. Our code and dataset are available at https://github.com/Dulpy/DocMSU.
翻译:多模态讽刺理解在新闻领域具有广泛的应用前景,如舆情分析与伪造检测。然而,现有基准数据集与方法通常聚焦于句子级讽刺理解。在文档级新闻中,讽刺线索往往稀疏微小,且常隐藏于长文本之中。此外,相较于推文等主要关注少数潮流或热点话题(如体育赛事)的句子级评论,新闻内容呈现显著多样性。为句子级讽刺理解设计的模型难以捕捉文档级新闻中的讽刺线索。为填补这一空白,我们提出了面向文档级多模态讽刺理解的全方位基准数据集DocMSU。本数据集包含102,588篇图文新闻,覆盖健康、商业等9类多元化主题。所提出的大规模、多维度DocMSU显著推动了真实场景下文档级讽刺理解的研究。为应对DocMSU带来的新挑战,我们引入了一种细粒度讽刺理解方法,旨在实现文档中像素级图像特征与词语级文本特征的精准对齐。实验表明,该方法可作为应对高难度DocMSU基准的基线方案。代码与数据集已开源至https://github.com/Dulpy/DocMSU。