Sarcasm is a complex form of figurative language in which the intended meaning contradicts the literal one. Its prevalence in social media and popular culture poses persistent challenges for natural language understanding, sentiment analysis, and content moderation. With the emergence of multimodal large language models, sarcasm detection extends beyond text and requires integrating cues from audio and vision. We present MuSaG, the first German multimodal sarcasm detection dataset, consisting of 33 minutes of manually selected and human-annotated statements from German television shows. Each instance provides aligned text, audio, and video modalities, annotated separately by humans, enabling evaluation in unimodal and multimodal settings. We benchmark nine open-source and commercial models, spanning text, audio, vision, and multimodal architectures, and compare their performance to human annotations. Our results show that while humans rely heavily on audio in conversational settings, models perform best on text. This highlights a gap in current multimodal models and motivates the use of MuSaG for developing models better suited to realistic scenarios. We release MuSaG publicly to support future research on multimodal sarcasm detection and human-model alignment.
翻译:讽刺是一种复杂的比喻性语言形式,其真实意图与字面含义相矛盾。讽刺在社交媒体和流行文化中的普遍存在,对自然语言理解、情感分析和内容审核构成了持续的挑战。随着多模态大语言模型的出现,讽刺检测已超越文本范畴,需要整合来自音频和视觉的线索。我们提出了MuSaG,这是首个德语多模态讽刺检测数据集,包含从德国电视节目中手动筛选并人工标注的33分钟陈述片段。每个实例均提供对齐的文本、音频和视频模态,并由人工分别标注,支持单模态和多模态场景下的评估。我们对九种开源及商业模型进行了基准测试,涵盖文本、音频、视觉和多模态架构,并将其性能与人工标注结果进行比较。研究结果表明,在对话场景中人类高度依赖音频线索,而模型在文本上表现最佳。这揭示了当前多模态模型的不足,并推动了利用MuSaG开发更适应现实场景的模型。我们公开发布MuSaG,以支持未来在多模态讽刺检测及人机对齐方面的研究。