The foundation model paradigm leverages a shared foundation model to achieve state-of-the-art (SOTA) performance for various tasks, requiring minimal downstream-specific modeling and data annotation. This approach has proven crucial in the field of Natural Language Processing (NLP). However, the speech processing community lacks a similar setup to explore the paradigm systematically. In this work, we establish the Speech processing Universal PERformance Benchmark (SUPERB) to study the effectiveness of the paradigm for speech. We propose a unified multi-tasking framework to address speech processing tasks in SUPERB using a frozen foundation model followed by task-specialized, lightweight prediction heads. Combining our results with community submissions, we verify that the foundation model paradigm is promising for speech, and our multi-tasking framework is simple yet effective, as the best-performing foundation model shows competitive generalizability across most SUPERB tasks. For reproducibility and extensibility, we have developed a long-term maintained platform that enables deterministic benchmarking, allows for result sharing via an online leaderboard, and promotes collaboration through a community-driven benchmark database to support new development cycles. Finally, we conduct a series of analyses to offer an in-depth understanding of SUPERB and speech foundation models, including information flows across tasks inside the models, the correctness of the weighted-sum benchmarking protocol and the statistical significance and robustness of the benchmark.
翻译:基础模型范式利用共享的基础模型在各种任务上实现了最先进的性能,仅需最少的任务特定建模和数据标注。这种方法在自然语言处理(NLP)领域已被证明至关重要。然而,语音处理社区缺乏类似的设置来系统地探索这一范式。在本工作中,我们建立了语音处理通用性能基准(SUPERB)来研究该范式在语音领域的有效性。我们提出一个统一的多任务框架,通过冻结基础模型后接任务特定的轻量级预测头来处理SUPERB中的语音处理任务。结合我们与社区提交的结果,我们验证了基础模型范式在语音领域具有前景,且我们的多任务框架简单有效,性能最佳的基础模型在大多SUPERB任务上展现出竞争力的泛化能力。为保证可复现性和可扩展性,我们开发了一个长期维护的平台,支持确定性基准测试,允许通过在线排行榜共享结果,并通过社区驱动的基准数据库促进协作以支持新开发周期。最后,我们进行了一系列分析以深入理解SUPERB和语音基础模型,包括模型内任务间的信息流、加权求和基准测试协议的正确性,以及基准的统计显著性和鲁棒性。