This survey paper provides a comprehensive overview of the recent advancements and challenges in applying large language models to the field of audio signal processing. Audio processing, with its diverse signal representations and a wide range of sources--from human voices to musical instruments and environmental sounds--poses challenges distinct from those found in traditional Natural Language Processing scenarios. Nevertheless, \textit{Large Audio Models}, epitomized by transformer-based architectures, have shown marked efficacy in this sphere. By leveraging massive amount of data, these models have demonstrated prowess in a variety of audio tasks, spanning from Automatic Speech Recognition and Text-To-Speech to Music Generation, among others. Notably, recently these Foundational Audio Models, like SeamlessM4T, have started showing abilities to act as universal translators, supporting multiple speech tasks for up to 100 languages without any reliance on separate task-specific systems. This paper presents an in-depth analysis of state-of-the-art methodologies regarding \textit{Foundational Large Audio Models}, their performance benchmarks, and their applicability to real-world scenarios. We also highlight current limitations and provide insights into potential future research directions in the realm of \textit{Large Audio Models} with the intent to spark further discussion, thereby fostering innovation in the next generation of audio-processing systems. Furthermore, to cope with the rapid development in this area, we will consistently update the relevant repository with relevant recent articles and their open-source implementations at https://github.com/EmulationAI/awesome-large-audio-models.
翻译:本综述论文全面概述了将大型语言模型应用于音频信号处理领域的最新进展与挑战。音频处理因其多样化的信号表示和广泛的声源——从人声到乐器及环境声音——面临着与传统自然语言处理场景截然不同的挑战。然而,以基于Transformer的架构为代表的大型音频模型在该领域已展现出显著效能。通过利用海量数据,这些模型在多种音频任务中表现出色,涵盖自动语音识别、文本转语音、音乐生成等。值得注意的是,近期这些基础音频模型(如SeamlessM4T)已开始展现作为通用翻译器的能力,支持多达100种语言的多项语音任务,且无需依赖独立的特定任务系统。本文深入分析了当前最先进的基础大型音频模型方法论、其性能基准以及对现实场景的适用性。我们还强调了当前局限性,并针对大型音频模型领域的潜在未来研究方向提供了见解,旨在激发进一步讨论,从而推动下一代音频处理系统的创新。此外,为应对该领域的快速发展,我们将在https://github.com/EmulationAI/awesome-large-audio-models持续更新相关资源库,收录最新相关论文及其开源实现。