Large neural networks can now generate jokes, but do they really "understand" humor? We challenge AI models with three tasks derived from the New Yorker Cartoon Caption Contest: matching a joke to a cartoon, identifying a winning caption, and explaining why a winning caption is funny. These tasks encapsulate progressively more sophisticated aspects of "understanding" a cartoon; key elements are the complex, often surprising relationships between images and captions and the frequent inclusion of indirect and playful allusions to human experience and culture. We investigate both multimodal and language-only models: the former are challenged with the cartoon images directly, while the latter are given multifaceted descriptions of the visual scene to simulate human-level visual understanding. We find that both types of models struggle at all three tasks. For example, our best multimodal models fall 30 accuracy points behind human performance on the matching task, and, even when provided ground-truth visual scene descriptors, human-authored explanations are preferred head-to-head over the best machine-authored ones (few-shot GPT-4) in more than 2/3 of cases. We release models, code, leaderboard, and corpus, which includes newly-gathered annotations describing the image's locations/entities, what's unusual in the scene, and an explanation of the joke.
翻译:大型神经网络如今已能生成笑话,但它们真的“理解”幽默吗?我们通过《纽约客》漫画配文比赛衍生出的三项任务挑战AI模型:将笑话与漫画匹配、识别获奖配文、以及解释获奖配文为何有趣。这些任务逐步涵盖了“理解”漫画的更深层要素,核心在于图像与配文之间复杂且常出人意料的关系,以及其中频繁融入对人类经验与文化的间接、俏皮隐喻。我们同时探究了多模态模型与纯语言模型:前者直接面对漫画图像,后者则通过多层面的视觉场景描述来模拟人类级别的视觉理解。结果表明,两类模型在所有三项任务上均表现挣扎。例如,最优多模态模型在匹配任务中比人类表现落后30个准确率百分点;即便提供真实视觉场景描述符,在超过三分之二的案例中,人类撰写的解释仍显著优于最佳机器生成解释(少样本GPT-4)。我们已公开发布模型、代码、排行榜及语料库,其中包含新收集的关于图像位置/实体、场景异常之处及笑话解释的标注信息。