Text language models have shown remarkable zero-shot capability in generalizing to unseen tasks when provided with well-formulated instructions. However, existing studies in speech processing primarily focus on limited or specific tasks. Moreover, the lack of standardized benchmarks hinders a fair comparison across different approaches. Thus, we present Dynamic-SUPERB, a benchmark designed for building universal speech models capable of leveraging instruction tuning to perform multiple tasks in a zero-shot fashion. To achieve comprehensive coverage of diverse speech tasks and harness instruction tuning, we invite the community to collaborate and contribute, facilitating the dynamic growth of the benchmark. To initiate, Dynamic-SUPERB features 55 evaluation instances by combining 33 tasks and 22 datasets. This spans a broad spectrum of dimensions, providing a comprehensive platform for evaluation. Additionally, we propose several approaches to establish benchmark baselines. These include the utilization of speech models, text language models, and the multimodal encoder. Evaluation results indicate that while these baselines perform reasonably on seen tasks, they struggle with unseen ones. We also conducted an ablation study to assess the robustness and seek improvements in the performance. We release all materials to the public and welcome researchers to collaborate on the project, advancing technologies in the field together.
翻译:文本语言模型在接收精心设计的指令时,展现出显著的零样本泛化能力,能够推广到未见任务。然而,现有语音处理研究主要集中在有限或特定任务上。此外,缺乏标准化基准阻碍了不同方法间的公平比较。为此,我们提出动态SUPERB基准,旨在构建能够利用指令调优以零样本方式执行多项任务的通用语音模型。为实现对多样化语音任务的全面覆盖并充分利用指令调优,我们邀请社区协作贡献,促进基准的动态扩展。初始阶段,动态SUPERB组合了33个任务与22个数据集,形成55个评估实例。该基准覆盖广泛维度,为评估提供全面平台。此外,我们提出了多种方法建立基线,包括利用语音模型、文本语言模型及多模态编码器。评估结果表明,这些基线在已见任务上表现合理,但在未见任务上存在困难。我们还通过消融研究评估鲁棒性并寻求性能改进。我们已公开所有资源,欢迎研究人员参与合作,共同推动该领域技术发展。