Text language models have shown remarkable zero-shot capability in generalizing to unseen tasks when provided with well-formulated instructions. However, existing studies in speech processing primarily focus on limited or specific tasks. Moreover, the lack of standardized benchmarks hinders a fair comparison across different approaches. Thus, we present Dynamic-SUPERB, a benchmark designed for building universal speech models capable of leveraging instruction tuning to perform multiple tasks in a zero-shot fashion. To achieve comprehensive coverage of diverse speech tasks and harness instruction tuning, we invite the community to collaborate and contribute, facilitating the dynamic growth of the benchmark. To initiate, Dynamic-SUPERB features 55 evaluation instances by combining 33 tasks and 22 datasets. This spans a broad spectrum of dimensions, providing a comprehensive platform for evaluation. Additionally, we propose several approaches to establish benchmark baselines. These include the utilization of speech models, text language models, and the multimodal encoder. Evaluation results indicate that while these baselines perform reasonably on seen tasks, they struggle with unseen ones. We release all materials to the public and welcome researchers to collaborate on the project, advancing technologies in the field together.
翻译:文本语言模型在接收良好指令时展现出零样本泛化至未见任务的卓越能力。然而,当前语音处理研究主要集中在有限或特定任务上。此外,标准化基准的缺乏阻碍了不同方法之间的公平比较。为此,我们提出Dynamic-SUPERB基准,旨在构建能够利用指令微调以零样本方式执行多任务的通用语音模型。为实现对多样化语音任务的全面覆盖并充分利用指令微调,我们邀请学术界共同协作贡献,推动基准的动态发展。初始版本中,Dynamic-SUPERB通过整合33个任务与22个数据集,包含55个评估实例,覆盖广泛维度,构建了综合性评估平台。同时,我们提出多种方法建立基准基线,涵盖语音模型、文本语言模型及多模态编码器的应用。评估结果表明,这些基线在已见任务上表现合理,但在处理未见任务时存在困难。我们向公众开放所有资源,并诚邀研究者参与项目协作,共同推动该领域技术发展。