This survey paper provides a comprehensive overview of the recent advancements and challenges in applying large language models to the field of audio signal processing. Audio processing, with its diverse signal representations and a wide range of sources--from human voices to musical instruments and environmental sounds--poses challenges distinct from those found in traditional Natural Language Processing scenarios. Nevertheless, \textit{Large Audio Models}, epitomized by transformer-based architectures, have shown marked efficacy in this sphere. By leveraging massive amount of data, these models have demonstrated prowess in a variety of audio tasks, spanning from Automatic Speech Recognition and Text-To-Speech to Music Generation, among others. Notably, recently these Foundational Audio Models, like SeamlessM4T, have started showing abilities to act as universal translators, supporting multiple speech tasks for up to 100 languages without any reliance on separate task-specific systems. This paper presents an in-depth analysis of state-of-the-art methodologies regarding \textit{Foundational Large Audio Models}, their performance benchmarks, and their applicability to real-world scenarios. We also highlight current limitations and provide insights into potential future research directions in the realm of \textit{Large Audio Models} with the intent to spark further discussion, thereby fostering innovation in the next generation of audio-processing systems. Furthermore, to cope with the rapid development in this area, we will consistently update the relevant repository with relevant recent articles and their open-source implementations at https://github.com/EmulationAI/awesome-large-audio-models.
翻译:本综述论文全面概述了将大型语言模型应用于音频信号处理领域的最新进展与挑战。音频处理因其多样化的信号表示形式和广泛的声源类型(从人声到乐器乃至环境音)而面临着与传统自然语言处理场景截然不同的挑战。尽管如此,以基于Transformer架构为代表的“大型音频模型”在该领域已展现出显著成效。通过利用海量数据,这些模型在多种音频任务中表现出卓越能力,涵盖自动语音识别、文本转语音乃至音乐生成等领域。值得注意的是,近期诸如SeamlessM4T等基础音频模型已开始展现出通用翻译器的能力,支持多达100种语言的多种语音任务,而无需依赖独立的专用系统。本文深入分析了当前关于“基础大型音频模型”的最先进方法、其性能基准以及在实际场景中的适用性。我们还指出了当前存在的局限性,并对“大型音频模型”领域潜在的未来研究方向提供了见解,旨在激发进一步讨论,从而推动下一代音频处理系统的创新。此外,为应对该领域的快速发展,我们将在https://github.com/EmulationAI/awesome-large-audio-models持续更新相关资源库,收录最新的相关论文及其开源实现。