Multimodal foundation models, such as Gemini and ChatGPT, have revolutionized human-machine interactions by seamlessly integrating various forms of data. Developing a universal spoken language model that comprehends a wide range of natural language instructions is critical for bridging communication gaps and facilitating more intuitive interactions. However, the absence of a comprehensive evaluation benchmark poses a significant challenge. We present Dynamic-SUPERB Phase-2, an open and evolving benchmark for the comprehensive evaluation of instruction-based universal speech models. Building upon the first generation, this second version incorporates 125 new tasks contributed collaboratively by the global research community, expanding the benchmark to a total of 180 tasks, making it the largest benchmark for speech and audio evaluation. While the first generation of Dynamic-SUPERB was limited to classification tasks, Dynamic-SUPERB Phase-2 broadens its evaluation capabilities by introducing a wide array of novel and diverse tasks, including regression and sequence generation, across speech, music, and environmental audio. Evaluation results indicate that none of the models performed well universally. SALMONN-13B excelled in English ASR, while WavLLM demonstrated high accuracy in emotion recognition, but current models still require further innovations to handle a broader range of tasks. We will soon open-source all task data and the evaluation pipeline.
翻译:多模态基础模型,如 Gemini 和 ChatGPT,通过无缝整合多种形式的数据,彻底改变了人机交互。开发一个能够理解广泛自然语言指令的通用口语语言模型,对于弥合沟通鸿沟和促进更直观的交互至关重要。然而,缺乏全面的评估基准构成了重大挑战。我们提出了动态-SUPERB 第二阶段,这是一个开放且不断演进的基准,用于对基于指令的通用语音模型进行全面评估。在第一代基础上,该第二版本整合了由全球研究社区协作贡献的125项新任务,将基准扩展至总计180项任务,使其成为语音和音频评估领域最大的基准。尽管第一代动态-SUPERB 仅限于分类任务,但动态-SUPERB 第二阶段通过引入涵盖语音、音乐和环境音频的广泛新颖且多样化的任务(包括回归和序列生成),扩展了其评估能力。评估结果表明,没有模型在所有任务上均表现优异。SALMONN-13B 在英语自动语音识别方面表现出色,而 WavLLM 在情感识别方面展现出高准确率,但现有模型仍需进一步创新以处理更广泛的任务范围。我们将很快开源所有任务数据和评估流程。