ViMU: Benchmarking Video Metaphorical Understanding

Any new medium, once it emerges, is used for more than the transmission of overt content alone. The information it carries typically operates on two levels: one is the content directly presented, while the other is the subtext beneath it-the implicit ideas and intentions the creator seeks to convey through the medium. Likewise, since video technologies became widely adopted, video has served not only as a powerful tool for recording and communicating visual information, but also as a vehicle for emotions, attitudes, and social meanings that are often difficult to articulate explicitly. Thus, the true meaning of many videos does not reside solely in what is shown on screen; it is often embedded in context, style of expression, and the viewer's social experience. Some forms of such video subtext are humorous, while others carry irony, mockery, or criticism. These implicit meanings can also be interpreted very differently across cultural backgrounds and social groups. However, most existing video understanding models still focus primarily on literal visual comprehension, such as recognizing objects, actions, or temporal relations, and lack a systematic ability to understand the metaphorical, ironic, and social meanings embedded in videos. To bridge this gap, we introduce ViMU, the first benchmark designed to systematically evaluate the subtext understanding capabilities of frontier models in videos. ViMU assesses whether video understanding models can go beyond literal perception to infer implicit meaning while grounding their interpretations in multimodal evidence and answering both open-ended and multiple-choice questions. Importantly, all questions are designed to be hint-free, ensuring that no key evidence is disclosed to models before answering.

翻译：任何新媒体一旦出现，其用途便超越了单纯显性内容的传输。它所承载的信息通常在两个层面运作：一是直接呈现的内容，二是其下的潜台词——创作者试图通过媒介传达的隐含思想和意图。同样，自视频技术被广泛采用以来，视频不仅成为记录和交流视觉信息的强大工具，也成为情感、态度及社会意义的载体，这些意义往往难以明确表述。因此，许多视频的真实含义并不仅仅存在于屏幕所呈现的内容中，而常常嵌入于语境、表达风格以及观众的社会经验中。某些此类视频潜台词形式具有幽默感，而另一些则带有讽刺、嘲弄或批评。这些隐含意义在不同文化背景和社会群体间也可能被截然不同地解读。然而，现有的大多数视频理解模型仍主要专注于字面视觉理解，例如识别物体、动作或时间关系，缺乏系统理解视频中蕴含的隐喻、讽刺及社会意义的能力。为弥补这一空白，我们提出了ViMU，这是首个旨在系统评估前沿模型在视频中理解潜台词能力的基准。ViMU评估视频理解模型是否能在超越字面感知的基础上，基于多模态证据推断隐含意义，并回答开放式和多项选择题。重要的是，所有问题均设计为无提示形式，确保在回答前不向模型透露任何关键证据。