LauraGPT: Listen, Attend, Understand, and Regenerate Audio with GPT

Zhihao Du,Jiaming Wang,Qian Chen,Yunfei Chu,Zhifu Gao,Zerui Li,Kai Hu,Xiaohuan Zhou,Jin Xu,Ziyang Ma,Wen Wang,Siqi Zheng,Chang Zhou,Zhijie Yan,Shiliang Zhang

from arxiv, 10 pages, work in progress

Generative Pre-trained Transformer (GPT) models have achieved remarkable performance on various natural language processing tasks, and have shown great potential as backbones for audio-and-text large language models (LLMs). Previous mainstream audio-and-text LLMs use discrete audio tokens to represent both input and output audio; however, they suffer from performance degradation on tasks such as automatic speech recognition, speech-to-text translation, and speech enhancement over models using continuous speech features. In this paper, we propose LauraGPT, a novel unified audio-and-text GPT-based LLM for audio recognition, understanding, and generation. LauraGPT is a versatile LLM that can process both audio and text inputs and generate outputs in either modalities. We propose a novel data representation that combines continuous and discrete features for audio: LauraGPT encodes input audio into continuous representations using an audio encoder and generates output audio from discrete codec codes. We propose a one-step codec vocoder to overcome the prediction challenge caused by the multimodal distribution of codec tokens. We fine-tune LauraGPT using supervised multi-task learning. Extensive experiments show that LauraGPT consistently achieves comparable to superior performance compared to strong baselines on a wide range of audio tasks related to content, semantics, paralinguistics, and audio-signal analysis, such as automatic speech recognition, speech-to-text translation, text-to-speech synthesis, speech enhancement, automated audio captioning, speech emotion recognition, and spoken language understanding.

翻译：生成式预训练Transformer（GPT）模型已在多种自然语言处理任务中取得显著性能，并展现出作为音文本大语言模型（LLM）骨干架构的巨大潜力。现有主流音文本LLM普遍采用离散音频令牌表征输入输出音频，但在自动语音识别、语音到文本翻译及语音增强等任务上，其性能相较于使用连续语音特征的模型存在显著退化。本文提出LauraGPT——一种基于GPT架构的创新型统一音文本大语言模型，具备音频识别、理解与生成能力。该模型为多模态通用LLM，可同时处理音频与文本输入，并生成任意模态的输出。我们提出融合连续与离散特征的音频数据表征方法：LauraGPT通过音频编码器将输入音频编码为连续表征，同时基于离散编解码器码本生成输出音频。为克服编解码器令牌多模态分布导致的预测难题，我们设计了一步式编解码声码器。通过监督式多任务学习对LauraGPT进行微调。大量实验表明，在涉及内容、语义、副语言学及音频信号分析的广泛任务中（包括自动语音识别、语音到文本翻译、文本到语音合成、语音增强、自动音频描述、语音情感识别及口语理解），LauraGPT均取得与强基线模型相当或更优的性能。