Multimodal punchlines, which involve humor or sarcasm conveyed in image-caption pairs, are a popular way of communication on online multimedia platforms. With the rapid development of multimodal large language models (MLLMs), it is essential to assess their ability to effectively comprehend these punchlines. However, existing benchmarks on punchline comprehension suffer from three major limitations: 1) language shortcuts that allow models to solely rely on text, 2) lack of question diversity, and 3) narrow focus on a specific domain of multimodal content (e.g., cartoon). To address these limitations, we introduce a multimodal \textbf{Punch}line comprehension \textbf{Bench}mark, named \textbf{PunchBench}, which is tailored for accurate and comprehensive evaluation of punchline comprehension. To enhance the evaluation accuracy, we generate synonymous and antonymous captions by modifying original captions, which mitigates the impact of shortcuts in the captions. To provide a comprehensive evaluation, PunchBench incorporates diverse question formats and image-captions from various domains. On this basis, we conduct extensive evaluations and reveal a significant gap between state-of-the-art MLLMs and humans in punchline comprehension. To improve punchline comprehension, we propose Simple-to-Complex Chain-of-Question (SC-CoQ) strategy, enabling the models to incrementally address complicated questions by first mastering simple ones. SC-CoQ effectively enhances the performance of various MLLMs on PunchBench, surpassing in-context learning and chain-of-thought.
翻译:多模态笑点(即通过图像-标题对传达的幽默或讽刺)是在线多媒体平台上流行的交流方式。随着多模态大语言模型(MLLMs)的快速发展,评估其有效理解此类笑点的能力至关重要。然而,现有的笑点理解基准存在三个主要局限:1)存在语言捷径,使模型可仅依赖文本;2)问题多样性不足;3)过度聚焦于特定领域的多模态内容(如漫画)。为应对这些局限,我们提出了一个专为精准全面评估笑点理解而设计的基准——多模态笑点理解基准(PunchBench)。为提高评估准确性,我们通过修改原始标题生成同义与反义标题,以缓解标题中捷径效应的影响。为实现全面评估,PunchBench整合了多样化的问题格式及跨领域的图像-标题对。在此基础上,我们开展了广泛评估,揭示了当前先进MLLMs与人类在笑点理解方面存在的显著差距。为提升笑点理解能力,我们提出了由简至繁的链式提问策略,使模型能够通过先掌握简单问题来逐步处理复杂问题。该策略有效提升了多种MLLMs在PunchBench上的性能,超越了上下文学习与思维链方法。