We introduce the Massive Video Embedding Benchmark (MVEB), a 23-task benchmark for video embeddings spanning classification, zero-shot classification, clustering, pair classification, retrieval, and video-centric question answering. We evaluate 33 models and find that no single model dominates: MLLM-based embeddings lead on classification, clustering, pair classification, and QA; multimodal binding leads on retrieval and zero-shot classification; generative MLLMs without contrastive adaptation collapse on cross-modal tasks. Paired video-only vs. audio+video evaluations show that audio's contribution depends on dataset annotation provenance: audio helps when labels were produced from both modalities and hurts when they were produced from visuals alone, a six-point gap consistent across model families. MVEB is derived from MVEB+, a 184-task pool, and is designed to maintain task diversity while reducing evaluation cost. It integrates into the MTEB ecosystem for unified evaluation across text, image, audio, and video. We release MVEB and all 184 tasks along with code and a leaderboard at https://github.com/embeddings-benchmark/mteb.
翻译:摘要:我们提出了大规模视频嵌入基准(MVEB),这是一个包含23项任务的视频嵌入基准,涵盖分类、零样本分类、聚类、对分类、检索以及以视频为中心的问答。我们评估了33个模型,发现没有单一模型占据主导地位:基于多模态大语言模型(MLLM)的嵌入在分类、聚类、对分类和问答方面领先;多模态绑定在检索和零样本分类方面领先;未经对比性适应的生成式MLLM在跨模态任务中性能崩溃。配对进行的纯视频与音视频评估显示,音频的贡献取决于数据集标注来源:当标签由两种模态共同生成时音频有帮助,而当标签仅由视觉生成时音频则有损,这一6个百分点的差距在不同模型家族中一致存在。MVEB源自MVEB+(一个包含184项任务的资源池),旨在保持任务多样性的同时降低评估成本。它整合到MTEB生态系统中,以实现文本、图像、音频和视频的统一评估。我们在https://github.com/embeddings-benchmark/mteb上发布MVEB及所有184项任务,并提供代码和排行榜。