Recent advancements in large multimodal language models have demonstrated remarkable proficiency across a wide range of tasks. Yet, these models still struggle with understanding the nuances of human humor through juxtaposition, particularly when it involves nonlinear narratives that underpin many jokes and humor cues. This paper investigates this challenge by focusing on comics with contradictory narratives, where each comic consists of two panels that create a humorous contradiction. We introduce the YesBut benchmark, which comprises tasks of varying difficulty aimed at assessing AI's capabilities in recognizing and interpreting these comics, ranging from literal content comprehension to deep narrative reasoning. Through extensive experimentation and analysis of recent commercial or open-sourced large (vision) language models, we assess their capability to comprehend the complex interplay of the narrative humor inherent in these comics. Our results show that even state-of-the-art models still lag behind human performance on this task. Our findings offer insights into the current limitations and potential improvements for AI in understanding human creative expressions.
翻译:近年来,大型多模态语言模型在广泛的任务中展现出了卓越的能力。然而,这些模型在理解通过并置体现的人类幽默的细微差别方面仍然存在困难,尤其是当涉及构成许多笑话和幽默线索的非线性叙事时。本文通过聚焦于具有矛盾叙事的漫画来研究这一挑战,其中每幅漫画由两个画格组成,以创造幽默的矛盾。我们引入了YesBut基准测试,它包含一系列难度不同的任务,旨在评估AI在识别和解释这些漫画方面的能力,范围从字面内容理解到深层叙事推理。通过对近期商业或开源的大型(视觉)语言模型进行广泛的实验和分析,我们评估了它们理解这些漫画中固有的叙事幽默复杂相互作用的能力。我们的结果表明,即使是最先进的模型在此任务上仍然落后于人类的表现。我们的发现为AI在理解人类创造性表达方面的当前局限性和潜在改进方向提供了见解。