Detecting sarcasm effectively requires a nuanced understanding of context, including vocal tones and facial expressions. The progression towards multimodal computational methods in sarcasm detection, however, faces challenges due to the scarcity of data. To address this, we present AMuSeD (Attentive deep neural network for MUltimodal Sarcasm dEtection incorporating bi-modal Data augmentation). This approach utilizes the Multimodal Sarcasm Detection Dataset (MUStARD) and introduces a two-phase bimodal data augmentation strategy. The first phase involves generating varied text samples through Back Translation from several secondary languages. The second phase involves the refinement of a FastSpeech 2-based speech synthesis system, tailored specifically for sarcasm to retain sarcastic intonations. Alongside a cloud-based Text-to-Speech (TTS) service, this Fine-tuned FastSpeech 2 system produces corresponding audio for the text augmentations. We also investigate various attention mechanisms for effectively merging text and audio data, finding self-attention to be the most efficient for bimodal integration. Our experiments reveal that this combined augmentation and attention approach achieves a significant F1-score of 81.0% in text-audio modalities, surpassing even models that use three modalities from the MUStARD dataset.
翻译:有效检测讽刺需要细致入微地理解上下文,包括语音语调和面部表情。然而,讽刺检测向多模态计算方法的发展因数据稀缺而面临挑战。为解决此问题,我们提出了AMuSeD(一种融合双模态数据增强的注意力深度神经网络用于多模态讽刺检测)。该方法利用多模态讽刺检测数据集(MUStARD),并引入了一种两阶段的双模态数据增强策略。第一阶段通过从多种次要语言进行回译来生成多样化的文本样本。第二阶段涉及对基于FastSpeech 2的语音合成系统进行精调,该系统专门针对讽刺语音设计以保留讽刺语调。结合基于云端的文本转语音服务,这个经过精调的FastSpeech 2系统为文本增强样本生成对应的音频。我们还研究了多种注意力机制以有效融合文本和音频数据,发现自注意力机制对于双模态融合最为高效。我们的实验表明,这种结合数据增强与注意力机制的方法在文本-音频模态上取得了81.0%的显著F1分数,甚至超越了使用MUStARD数据集中三种模态的模型。