Sarcasm Explanation in Dialogue (SED) is a new yet challenging task, which aims to generate a natural language explanation for the given sarcastic dialogue that involves multiple modalities (i.e., utterance, video, and audio). Although existing studies have achieved great success based on the generative pretrained language model BART, they overlook exploiting the sentiments residing in the utterance, video and audio, which are vital clues for sarcasm explanation. In fact, it is non-trivial to incorporate sentiments for boosting SED performance, due to three main challenges: 1) diverse effects of utterance tokens on sentiments; 2) gap between video-audio sentiment signals and the embedding space of BART; and 3) various relations among utterances, utterance sentiments, and video-audio sentiments. To tackle these challenges, we propose a novel sEntiment-enhanceD Graph-based multimodal sarcasm Explanation framework, named EDGE. In particular, we first propose a lexicon-guided utterance sentiment inference module, where a heuristic utterance sentiment refinement strategy is devised. We then develop a module named Joint Cross Attention-based Sentiment Inference (JCA-SI) by extending the multimodal sentiment analysis model JCA to derive the joint sentiment label for each video-audio clip. Thereafter, we devise a context-sentiment graph to comprehensively model the semantic relations among the utterances, utterance sentiments, and video-audio sentiments, to facilitate sarcasm explanation generation. Extensive experiments on the publicly released dataset WITS verify the superiority of our model over cutting-edge methods.
翻译:对话中的讽刺解释(Sarcasm Explanation in Dialogue, SED)是一项新兴且具有挑战性的任务,旨在为给定的涉及多模态信息(即文本、视频和音频)的讽刺对话生成自然语言解释。尽管现有研究基于生成式预训练语言模型BART已取得显著成功,但它们忽略了利用文本、视频和音频中蕴含的情感信息,而后者是讽刺解释的关键线索。实际上,融合情感信息以提升SED性能并非易事,主要面临三大挑战:1)文本标记对情感影响存在多样性;2)视频-音频情感信号与BART嵌入空间存在差异;3)文本、文本情感及视频-音频情感之间存在复杂关联。为应对这些挑战,我们提出了一种新颖的情感增强图多模态讽刺解释框架,命名为EDGE。具体而言,我们首先设计了一个词典引导的文本情感推断模块,并提出了启发式文本情感优化策略。随后,通过扩展多模态情感分析模型JCA,开发了基于联合交叉注意力的情感推断模块(JCA-SI),用于推导每个视频-音频片段的联合情感标签。在此基础上,我们构建了一个上下文-情感图,以全面建模文本、文本情感及视频-音频情感之间的语义关系,从而促进讽刺解释的生成。在公开数据集WITS上的大量实验表明,我们的模型优于现有前沿方法。