RCLMuFN: Relational Context Learning and Multiplex Fusion Network for Multimodal Sarcasm Detection

Sarcasm typically conveys emotions of contempt or criticism by expressing a meaning that is contrary to the speaker's true intent. Accurate detection of sarcasm aids in identifying and filtering undesirable information on the Internet, thereby reducing malicious defamation and rumor-mongering. Nonetheless, the task of automatic sarcasm detection remains highly challenging for machines, as it critically depends on intricate factors such as relational context. Most existing multimodal sarcasm detection methods focus on introducing graph structures to establish entity relationships between text and images while neglecting to learn the relational context between text and images, which is crucial evidence for understanding the meaning of sarcasm. In addition, the meaning of sarcasm changes with the evolution of different contexts, but existing methods may not be accurate in modeling such dynamic changes, limiting the generalization ability of the models. To address the above issues, we propose a relational context learning and multiplex fusion network (RCLMuFN) for multimodal sarcasm detection. Firstly, we employ four feature extractors to comprehensively extract features from raw text and images, aiming to excavate potential features that may have been previously overlooked. Secondly, we utilize the relational context learning module to learn the contextual information of text and images and capture the dynamic properties through shallow and deep interactions. Finally, we employ a multiplex feature fusion module to enhance the generalization of the model by penetratingly integrating multimodal features derived from various interaction contexts. Extensive experiments on two multimodal sarcasm detection datasets show that our proposed method achieves state-of-the-art performance.

翻译：讽刺通常通过表达与说话者真实意图相反的含义来传达蔑视或批评的情感。准确检测讽刺有助于识别和过滤互联网上的不良信息，从而减少恶意诽谤和谣言传播。然而，自动讽刺检测任务对机器而言仍然极具挑战性，因为它关键依赖于关系上下文等复杂因素。现有的大多数多模态讽刺检测方法侧重于引入图结构来建立文本与图像之间的实体关系，却忽视了学习文本与图像之间的**关系上下文**，而这是理解讽刺含义的关键证据。此外，讽刺的含义会随着不同上下文的演变而变化，但现有方法在建模此类动态变化时可能不够准确，限制了模型的泛化能力。为解决上述问题，我们提出了一种用于多模态讽刺检测的关系上下文学习与多重融合网络（RCLMuFN）。首先，我们采用四个特征提取器从原始文本和图像中全面提取特征，旨在挖掘可能先前被忽视的潜在特征。其次，我们利用关系上下文学习模块来学习文本和图像的上下文信息，并通过浅层与深层交互捕捉动态特性。最后，我们采用多重特征融合模块，通过深入整合源自不同交互上下文的多模态特征来增强模型的泛化能力。在两个多模态讽刺检测数据集上的大量实验表明，我们提出的方法取得了最先进的性能。