Multimodal Sarcasm Explanation (MuSE) is a new yet challenging task, which aims to generate a natural language sentence for a multimodal social post (an image as well as its caption) to explain why it contains sarcasm. Although the existing pioneer study has achieved great success with the BART backbone, it overlooks the gap between the visual feature space and the decoder semantic space, the object-level metadata of the image, as well as the potential external knowledge. To solve these limitations, in this work, we propose a novel mulTi-source sEmantic grAph-based Multimodal sarcasm explanation scheme, named TEAM. In particular, TEAM extracts the object-level semantic meta-data instead of the traditional global visual features from the input image. Meanwhile, TEAM resorts to ConceptNet to obtain the external related knowledge concepts for the input text and the extracted object meta-data. Thereafter, TEAM introduces a multi-source semantic graph that comprehensively characterize the multi-source (i.e., caption, object meta-data, external knowledge) semantic relations to facilitate the sarcasm reasoning. Extensive experiments on a public released dataset MORE verify the superiority of our model over cutting-edge methods.
翻译:多模态讽刺解释(MuSE)是一项新兴且具有挑战性的任务,旨在为多模态社交帖子(图像及其描述)生成自然语言句子,以解释其为何包含讽刺。尽管现有的开创性研究基于BART骨干网络取得了巨大成功,但忽略了视觉特征空间与解码器语义空间之间的差距、图像的物体级元数据以及潜在的外部知识。为解决这些局限性,本文提出了一种新颖的基于多源语义图的多模态讽刺解释方案,命名为TEAM。具体而言,TEAM从输入图像中提取物体级语义元数据,而非传统的全局视觉特征。同时,TEAM借助ConceptNet获取输入文本及提取的物体元数据的外部相关知识概念。随后,TEAM引入多源语义图,全面刻画多源(即描述、物体元数据、外部知识)语义关系,以促进讽刺推理。在公开数据集MORE上的大量实验验证了我们的模型相较于尖端方法的优越性。