LauraGPT: Listen, Attend, Understand, and Regenerate Audio with GPT

Jiaming Wang,Zhihao Du,Qian Chen,Yunfei Chu,Zhifu Gao,Zerui Li,Kai Hu,Xiaohuan Zhou,Jin Xu,Ziyang Ma,Wen Wang,Siqi Zheng,Chang Zhou,Zhijie Yan,Shiliang Zhang

from arxiv, 10 pages, under review

Generative Pre-trained Transformer (GPT) models have achieved remarkable performance on various natural language processing tasks. However, there has been limited research on applying similar frameworks to audio tasks. Previously proposed large language models for audio tasks either lack sufficient quantitative evaluations, or are limited to tasks for recognizing and understanding audio content, or significantly underperform existing state-of-the-art (SOTA) models. In this paper, we propose LauraGPT, a unified GPT model for audio recognition, understanding, and generation. LauraGPT is a versatile language model that can process both audio and text inputs and generate outputs in either modalities. It can perform a wide range of tasks related to content, semantics, paralinguistics, and audio-signal analysis. Some of its noteworthy tasks include automatic speech recognition, speech-to-text translation, text-to-speech synthesis, machine translation, speech enhancement, automated audio captioning, speech emotion recognition, and spoken language understanding. To achieve this goal, we use a combination of continuous and discrete features for audio. We encode input audio into continuous representations using an audio encoder and decode output audio from discrete codec codes. We then fine-tune a large decoder-only Transformer-based language model on multiple audio-to-text, text-to-audio, audio-to-audio, and text-to-text tasks using a supervised multitask learning approach. Extensive experiments show that LauraGPT achieves competitive or superior performance compared to existing SOTA models on various audio processing benchmarks.

翻译：生成式预训练Transformer（GPT）模型已在多种自然语言处理任务中展现出卓越性能，但将其框架应用于音频任务的研究仍较为有限。此前提出的大规模音频语言模型或缺乏充分的定量评估，或局限于音频内容的识别与理解任务，亦或在性能上显著落后于现有最先进（SOTA）模型。本文提出LauraGPT，一种统一的GPT模型，可同时实现音频识别、理解与生成。LauraGPT是一种多模态语言模型，能够处理音频和文本输入，并输出任意模态的结果。它可执行与内容、语义、副语言及音频信号分析相关的广泛任务，其中包括语音识别、语音到文本翻译、文本到语音合成、机器翻译、语音增强、自动音频字幕生成、语音情感识别及口语理解等关键应用。为实现这一目标，我们采用连续特征与离散特征相结合的音频表征方案：通过音频编码器将输入音频编码为连续表示，并从离散编解码码本中解码输出音频。随后，采用监督式多任务学习方法，在包含音频到文本、文本到音频、音频到音频以及文本到文本的多种任务上，对基于Transformer的大规模仅解码器语言模型进行微调。大量实验表明，LauraGPT在多个音频处理基准测试中取得了与现有SOTA模型竞争甚至更优的性能。