Dynamic-SUPERB Phase-2: A Collaboratively Expanding Benchmark for Measuring the Capabilities of Spoken Language Models with 180 Tasks

Chien-yu Huang,Wei-Chih Chen,Shu-wen Yang,Andy T. Liu,Chen-An Li,Yu-Xiang Lin,Wei-Cheng Tseng,Anuj Diwan,Yi-Jen Shih,Jiatong Shi,William Chen,Xuanjun Chen,Chi-Yuan Hsiao,Puyuan Peng,Shih-Heng Wang,Chun-Yi Kuan,Ke-Han Lu,Kai-Wei Chang,Chih-Kai Yang,Fabian Ritter-Gutierrez,Ming To Chuang,Kuan-Po Huang,Siddhant Arora,You-Kuan Lin,Eunjung Yeo,Kalvin Chang,Chung-Ming Chien,Kwanghee Choi,Cheng-Hsiu Hsieh,Yi-Cheng Lin,Chee-En Yu,I-Hsiang Chiu,Heitor R. Guimarães,Jionghao Han,Tzu-Quan Lin,Tzu-Yuan Lin,Homu Chang,Ting-Wu Chang,Chun Wei Chen,Shou-Jen Chen,Yu-Hua Chen,Hsi-Chun Cheng,Kunal Dhawan,Jia-Lin Fang,Shi-Xin Fang,Kuan-Yu Fang Chiang,Chi An Fu,Hsien-Fu Hsiao,Ching Yu Hsu,Shao-Syuan Huang,Lee Chen Wei,Hsi-Che Lin,Hsuan-Hao Lin,Hsuan-Ting Lin,Jian-Ren Lin,Ting-Chun Liu,Li-Chun Lu,Tsung-Min Pai,Ankita Pasad,Shih-Yun Shan Kuan,Suwon Shon,Yuxun Tang,Yun-Shao Tsai,Jui-Chiang Wei,Tzu-Chieh Wei,Chengxi Wu,Dien-Ruei Wu,Chao-Han Huck Yang,Chieh-Chi Yang,Jia Qi Yip,Shao-Xiang Yuan,Vahid Noroozi,Zhehuai Chen,Haibin Wu,Karen Livescu,David Harwath,Shinji Watanabe,Hung-yi Lee

Multimodal foundation models, such as Gemini and ChatGPT, have revolutionized human-machine interactions by seamlessly integrating various forms of data. Developing a universal spoken language model that comprehends a wide range of natural language instructions is critical for bridging communication gaps and facilitating more intuitive interactions. However, the absence of a comprehensive evaluation benchmark poses a significant challenge. We present Dynamic-SUPERB Phase-2, an open and evolving benchmark for the comprehensive evaluation of instruction-based universal speech models. Building upon the first generation, this second version incorporates 125 new tasks contributed collaboratively by the global research community, expanding the benchmark to a total of 180 tasks, making it the largest benchmark for speech and audio evaluation. While the first generation of Dynamic-SUPERB was limited to classification tasks, Dynamic-SUPERB Phase-2 broadens its evaluation capabilities by introducing a wide array of novel and diverse tasks, including regression and sequence generation, across speech, music, and environmental audio. Evaluation results indicate that none of the models performed well universally. SALMONN-13B excelled in English ASR, while WavLLM demonstrated high accuracy in emotion recognition, but current models still require further innovations to handle a broader range of tasks. We will soon open-source all task data and the evaluation pipeline.

翻译：多模态基础模型，如 Gemini 和 ChatGPT，通过无缝整合多种形式的数据，彻底改变了人机交互。开发一个能够理解广泛自然语言指令的通用口语语言模型，对于弥合沟通鸿沟和促进更直观的交互至关重要。然而，缺乏全面的评估基准构成了重大挑战。我们提出了动态-SUPERB 第二阶段，这是一个开放且不断演进的基准，用于对基于指令的通用语音模型进行全面评估。在第一代基础上，该第二版本整合了由全球研究社区协作贡献的125项新任务，将基准扩展至总计180项任务，使其成为语音和音频评估领域最大的基准。尽管第一代动态-SUPERB 仅限于分类任务，但动态-SUPERB 第二阶段通过引入涵盖语音、音乐和环境音频的广泛新颖且多样化的任务（包括回归和序列生成），扩展了其评估能力。评估结果表明，没有模型在所有任务上均表现优异。SALMONN-13B 在英语自动语音识别方面表现出色，而 WavLLM 在情感识别方面展现出高准确率，但现有模型仍需进一步创新以处理更广泛的任务范围。我们将很快开源所有任务数据和评估流程。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日