Surprising videos, such as funny clips, creative performances, or visual illusions, attract significant attention. Enjoyment of these videos is not simply a response to visual stimuli; rather, it hinges on the human capacity to understand (and appreciate) commonsense violations depicted in these videos. We introduce FunQA, a challenging video question-answering (QA) dataset specifically designed to evaluate and enhance the depth of video reasoning based on counter-intuitive and fun videos. Unlike most video QA benchmarks which focus on less surprising contexts, e.g., cooking or instructional videos, FunQA covers three previously unexplored types of surprising videos: 1) HumorQA, 2) CreativeQA, and 3) MagicQA. For each subset, we establish rigorous QA tasks designed to assess the model's capability in counter-intuitive timestamp localization, detailed video description, and reasoning around counter-intuitiveness. We also pose higher-level tasks, such as attributing a fitting and vivid title to the video and scoring the video creativity. In total, the FunQA benchmark consists of 312K free-text QA pairs derived from 4.3K video clips, spanning a total of 24 video hours. Moreover, we propose FunMentor, an agent designed for Vision-Language Models (VLMs) that uses multi-turn dialogues to enhance models' understanding of counter-intuitiveness. Extensive experiments with existing VLMs demonstrate the effectiveness of FunMentor and reveal significant performance gaps for the FunQA videos across spatial-temporal reasoning, visual-centered reasoning, and free-text generation.
翻译:令人惊奇的视频,例如搞笑片段、创意表演或视觉错觉,能吸引大量关注。对这些视频的欣赏并非单纯是对视觉刺激的反应,而是依赖于人类理解(并欣赏)其中所描绘的常识违背的能力。我们引入了FunQA,这是一个具有挑战性的视频问答(QA)数据集,专门设计用于基于反直觉和有趣的视频来评估和提升视频推理的深度。与大多数围绕烹饪或教学视频等较不令人惊奇的情境的视频问答基准不同,FunQA涵盖了三种先前未探索的惊奇视频类型:1)幽默问答,2)创意问答,以及3)魔术问答。对于每个子集,我们建立了严格的问答任务,旨在评估模型在反直觉时间戳定位、详细视频描述以及反直觉推理方面的能力。我们还提出了更高层次的任务,例如为视频赋予贴切且生动的标题以及评分视频的创意性。FunQA基准总共包含来自4.3K个视频片段的312K个自由文本问答对,总计24小时的视频时长。此外,我们提出了FunMentor,这是一个为视觉语言模型(VLM)设计的智能体,它通过多轮对话来增强模型对反直觉性的理解。对现有VLM的广泛实验证明了FunMentor的有效性,并揭示了在时空推理、以视觉为中心的推理和自由文本生成方面,针对FunQA视频存在显著的性能差距。