Internet audio-visual clips convey meaning through time-varying sound and motion, which extend beyond what text alone can represent. To examine whether AI models can understand such signals in human cultural contexts, we introduce AVMeme Exam, a human-curated benchmark of over one thousand iconic Internet sounds and videos spanning speech, songs, music, and sound effects. Each meme is paired with a unique Q&A assessing levels of understanding from surface content to context and emotion to usage and world knowledge, along with metadata such as original year, transcript, summary, and sensitivity. We systematically evaluate state-of-the-art multimodal large language models (MLLMs) alongside human participants using this benchmark. Our results reveal a consistent limitation: current models perform poorly on textless music and sound effects, and struggle to think in context and in culture compared to surface content. These findings highlight a key gap in human-aligned multimodal intelligence and call for models that can perceive contextually and culturally beyond the surface of what they hear and see. Project page: avmemeexam.github.io/public
翻译:互联网视听片段通过随时间变化的声音与动作传递意义,这超越了纯文本所能表达的范围。为检验人工智能模型能否在人类文化语境中理解此类信号,我们引入了AVMeme Exam——一个由人工精心构建的基准数据集,包含一千多个标志性的互联网声音与视频,涵盖语音、歌曲、音乐及音效。每个模因均配有独特的问答对,用于评估从表层内容到情境与情感,再到使用方式与世界知识的不同理解层次,同时附带原始年份、转录文本、摘要及敏感性等元数据。我们利用此基准系统性地评估了当前最先进的多模态大语言模型(MLLMs)并与人类参与者进行对比。研究结果揭示了一个持续存在的局限:当前模型在无文本的音乐与音效任务上表现欠佳,且在情境化与文化性思维方面相较于表层内容理解存在明显困难。这些发现凸显了人类对齐的多模态智能中的一个关键缺口,并呼吁开发能够超越视听表层、进行情境与文化感知的模型。项目页面:avmemeexam.github.io/public