We introduce the Massive Audio Embedding Benchmark (MAEB), a large-scale benchmark covering 30 tasks across speech, music, environmental sounds, and cross-modal audio-text reasoning in 100+ languages. We evaluate 50+ models and find that no single model dominates across all tasks: contrastive audio-text models excel at environmental sound classification (e.g., ESC50) but score near random on multilingual speech tasks (e.g., SIB-FLEURS), while speech-pretrained models show the opposite pattern. Clustering remains challenging for all models, with even the best-performing model achieving only modest results. We observe that models excelling on acoustic understanding often perform poorly on linguistic tasks, and vice versa. We also show that the performance of audio encoders on MAEB correlates highly with their performance when used in audio large language models. MAEB is derived from MAEB+, a collection of 98 tasks. MAEB is designed to maintain task diversity while reducing evaluation cost, and it integrates into the MTEB ecosystem for unified evaluation across text, image, and audio modalities. We release MAEB and all 98 tasks along with code and a leaderboard at https://github.com/embeddings-benchmark/mteb.
翻译:我们提出了大规模音频嵌入基准测试(MAEB),这是一个涵盖语音、音乐、环境声音以及跨模态音频-文本推理等30项任务的大规模基准测试,支持100多种语言。我们评估了50多个模型,发现没有单一模型能在所有任务中占据主导地位:对比式音频-文本模型在环境声音分类(例如ESC50)上表现出色,但在多语言语音任务(例如SIB-FLEURS)上得分接近随机水平,而语音预训练模型则呈现相反的模式。聚类任务对所有模型而言仍然具有挑战性,即使表现最佳的模型也仅取得中等结果。我们观察到,在声学理解方面表现优异的模型通常在语言任务上表现不佳,反之亦然。我们还发现,音频编码器在MAEB上的性能与其在音频大语言模型中的使用性能高度相关。MAEB源自包含98项任务的MAEB+集合。MAEB旨在保持任务多样性的同时降低评估成本,并已集成到MTEB生态系统中,以实现跨文本、图像和音频模态的统一评估。我们在https://github.com/embeddings-benchmark/mteb 发布了MAEB、全部98项任务以及相关代码和排行榜。