Audio is a critical component of multimodal perception, and any truly intelligent system must demonstrate a wide range of auditory capabilities. These capabilities include transcription, classification, retrieval, reasoning, segmentation, clustering, reranking, and reconstruction. Fundamentally, each task involves transforming a raw audio signal into a meaningful 'embedding' - be it a single vector, a sequence of continuous or discrete representations, or another structured form - which then serves as the basis for generating the task's final response. To accelerate progress towards robust machine auditory intelligence, we present the Massive Sound Embedding Benchmark (MSEB): an extensible framework designed to evaluate the auditory components of any multimodal system. In its first release, MSEB offers a comprehensive suite of eight core tasks, with more planned for the future, supported by diverse datasets, including the new, large-scale Simple Voice Questions (SVQ) dataset. Our initial experiments establish clear performance headrooms, highlighting the significant opportunity to improve real-world multimodal experiences where audio is a core signal. We encourage the research community to use MSEB to assess their algorithms and contribute to its growth. The library is publicly hosted at github.
翻译:音频是多模态感知的关键组成部分,任何真正智能的系统都必须展现出广泛的听觉能力。这些能力包括转录、分类、检索、推理、分割、聚类、重排序和重建。从根本上说,每项任务都涉及将原始音频信号转换为有意义的“嵌入”——无论是单个向量、连续或离散表示的序列,还是其他结构化形式——然后以此为基础生成任务的最终响应。为了加速实现鲁棒的机器听觉智能,我们提出了大规模声音嵌入基准测试(MSEB):一个旨在评估任何多模态系统听觉组件的可扩展框架。在其首次发布中,MSEB提供了一套全面的八项核心任务(未来计划纳入更多任务),并得到多样化数据集的支持,包括新的大规模简单语音问答(SVQ)数据集。我们的初步实验确立了清晰的性能提升空间,突显了在音频作为核心信号的真实世界多模态体验方面存在显著的改进机会。我们鼓励研究社区使用MSEB评估其算法并为其发展做出贡献。该库已公开托管于github。