Foundation models (FMs) are increasingly spearheading recent advances on a variety of tasks that fall under the purview of computer audition -- the use of machines to understand sounds. They feature several advantages over traditional pipelines: among others, the ability to consolidate multiple tasks in a single model, the option to leverage knowledge from other modalities, and the readily-available interaction with human users. Naturally, these promises have created substantial excitement in the audio community, and have led to a wave of early attempts to build new, general-purpose foundation models for audio. In the present contribution, we give an overview of computational audio analysis as it transitions from traditional pipelines towards auditory foundation models. Our work highlights the key operating principles that underpin those models, and showcases how they can accommodate multiple tasks that the audio community previously tackled separately.
翻译:基础模型正日益引领计算机听觉领域——即利用机器理解声音——在多种任务上的最新进展。与传统处理流程相比,基础模型具有若干优势:包括将多项任务整合于单一模型的能力、利用跨模态知识的可能性,以及与人类用户即时交互的便捷性。这些前景自然在音频学界引发了广泛关注,并催生了构建新型通用音频基础模型的早期尝试浪潮。本文概述了计算音频分析从传统处理流程向听觉基础模型的演进路径。我们的研究着重阐释了支撑这些模型的核心运行原理,并展示其如何整合音频学界以往分别处理的多种任务。