Sparks of Large Audio Models: A Survey and Outlook

Siddique Latif,Moazzam Shoukat,Fahad Shamshad,Muhammad Usama,Yi Ren,Heriberto Cuayáhuitl,Wenwu Wang,Xulong Zhang,Roberto Togneri,Erik Cambria,Björn W. Schuller

from arxiv, Under review, Repo URL: https://github.com/EmulationAI/awesome-large-audio-models

This survey paper provides a comprehensive overview of the recent advancements and challenges in applying large language models to the field of audio signal processing. Audio processing, with its diverse signal representations and a wide range of sources--from human voices to musical instruments and environmental sounds--poses challenges distinct from those found in traditional Natural Language Processing scenarios. Nevertheless, \textit{Large Audio Models}, epitomized by transformer-based architectures, have shown marked efficacy in this sphere. By leveraging massive amount of data, these models have demonstrated prowess in a variety of audio tasks, spanning from Automatic Speech Recognition and Text-To-Speech to Music Generation, among others. Notably, recently these Foundational Audio Models, like SeamlessM4T, have started showing abilities to act as universal translators, supporting multiple speech tasks for up to 100 languages without any reliance on separate task-specific systems. This paper presents an in-depth analysis of state-of-the-art methodologies regarding \textit{Foundational Large Audio Models}, their performance benchmarks, and their applicability to real-world scenarios. We also highlight current limitations and provide insights into potential future research directions in the realm of \textit{Large Audio Models} with the intent to spark further discussion, thereby fostering innovation in the next generation of audio-processing systems. Furthermore, to cope with the rapid development in this area, we will consistently update the relevant repository with relevant recent articles and their open-source implementations at https://github.com/EmulationAI/awesome-large-audio-models.

翻译：本综述论文全面概述了将大型语言模型应用于音频信号处理领域的最新进展与挑战。音频处理因其多样化的信号表示和广泛的声源——从人声到乐器及环境声音——面临着与传统自然语言处理场景截然不同的挑战。然而，以基于Transformer的架构为代表的大型音频模型在该领域已展现出显著效能。通过利用海量数据，这些模型在多种音频任务中表现出色，涵盖自动语音识别、文本转语音、音乐生成等。值得注意的是，近期这些基础音频模型（如SeamlessM4T）已开始展现作为通用翻译器的能力，支持多达100种语言的多项语音任务，且无需依赖独立的特定任务系统。本文深入分析了当前最先进的基础大型音频模型方法论、其性能基准以及对现实场景的适用性。我们还强调了当前局限性，并针对大型音频模型领域的潜在未来研究方向提供了见解，旨在激发进一步讨论，从而推动下一代音频处理系统的创新。此外，为应对该领域的快速发展，我们将在https://github.com/EmulationAI/awesome-large-audio-models持续更新相关资源库，收录最新相关论文及其开源实现。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日