The ever-evolving social media discourse has witnessed an overwhelming use of memes to express opinions or dissent. Besides being misused for spreading malcontent, they are mined by corporations and political parties to glean the public's opinion. Therefore, memes predominantly offer affect-enriched insights towards ascertaining the societal psyche. However, the current approaches are yet to model the affective dimensions expressed in memes effectively. They rely extensively on large multimodal datasets for pre-training and do not generalize well due to constrained visual-linguistic grounding. In this paper, we introduce MOOD (Meme emOtiOns Dataset), which embodies six basic emotions. We then present ALFRED (emotion-Aware muLtimodal Fusion foR Emotion Detection), a novel multimodal neural framework that (i) explicitly models emotion-enriched visual cues, and (ii) employs an efficient cross-modal fusion via a gating mechanism. Our investigation establishes ALFRED's superiority over existing baselines by 4.94% F1. Additionally, ALFRED competes strongly with previous best approaches on the challenging Memotion task. We then discuss ALFRED's domain-agnostic generalizability by demonstrating its dominance on two recently-released datasets - HarMeme and Dank Memes, over other baselines. Further, we analyze ALFRED's interpretability using attention maps. Finally, we highlight the inherent challenges posed by the complex interplay of disparate modality-specific cues toward meme analysis.
翻译:不断演变的社交媒体话语中,表情包被广泛用于表达观点或异议。除了被滥用于传播不满情绪,企业和政党还利用它们挖掘公众舆论。因此,表情包主要通过提供富含情感的信息来揭示社会心理。然而,当前方法尚未有效建模表情包所表达的情感维度。它们过度依赖大规模多模态数据集进行预训练,并因视觉-语言基础受限而泛化能力不足。本文提出了MOOD(表情情感数据集),该数据集包含六种基本情绪。随后我们提出了ALFRED(情绪感知多模态融合情感检测框架),这是一个新颖的多模态神经框架,能够:(i)显式建模富含情感的视觉线索,(ii)通过门控机制实现高效的跨模态融合。研究表明,ALFRED在F1分数上比现有基线模型高出4.94%。此外,ALFRED在极具挑战性的Memotion任务中与先前最佳方法表现相当。我们通过展示ALFRED在最近发布的两个数据集(HarMeme和Dank Memes)上优于其他基线模型的表现,论述了其领域无关的泛化能力。进一步地,我们利用注意力图分析了ALFRED的可解释性。最后,我们强调了不同模态特异性线索的复杂交互对表情包分析带来的固有挑战。