We propose WHISPER-GPT: A generative large language model (LLM) for speech and music that allows us to work with continuous audio representations and discrete tokens simultaneously as part of a single architecture. There has been a huge surge in generative audio, speech, and music models that utilize discrete audio tokens derived from neural compression algorithms, e.g. ENCODEC. However, one of the major drawbacks of this approach is handling the context length. It blows up for high-fidelity generative architecture if one has to account for all the audio contents at various frequencies for the next token prediction. By combining continuous audio representation like the spectrogram and discrete acoustic tokens, we retain the best of both worlds: Have all the information needed from the audio at a specific time instance in a single token, yet allow LLM to predict the future token to allow for sampling and other benefits discrete space provides. We show how our architecture improves the perplexity and negative log-likelihood scores for the next token prediction compared to a token-based LLM for speech and music.
翻译:我们提出 WHISPER-GPT:一种用于语音与音乐的生成式大语言模型(LLM),该模型允许在同一架构中同时处理连续音频表征与离散令牌。近年来,利用神经压缩算法(如 ENCODEC)提取的离散音频令牌的生成式音频、语音与音乐模型出现了爆发式增长。然而,此类方法的主要缺陷之一在于上下文长度的处理:若需考虑各频率下所有音频内容以进行下一个令牌预测,高保真生成架构的上下文长度将急剧膨胀。通过结合频谱图等连续音频表征与离散声学令牌,我们兼具两者优势:既能从音频特定时刻的单个令牌中获取全部所需信息,又能让 LLM 预测未来令牌以利用离散空间提供的采样及其他优势。我们展示了相较于基于令牌的语音与音乐 LLM,本架构如何提升下一个令牌预测的困惑度与负对数似然得分。