Audio Language Models (ALM) have emerged as the dominant paradigm for speech and music generation by representing audio as sequences of discrete tokens. Yet, unlike text tokens, which are invertible, audio tokens are extracted from lossy codecs with a limited bitrate. As a consequence, increasing audio quality requires generating more tokens, which imposes a trade-off between fidelity and computational cost. We address this issue by studying Continuous Audio Language Models (CALM). These models instantiate a large Transformer backbone that produces a contextual embedding at every timestep. This sequential information then conditions an MLP that generates the next continuous frame of an audio VAE through consistency modeling. By avoiding lossy compression, CALM achieves higher quality at lower computational cost than their discrete counterpart. Experiments on speech and music demonstrate improved efficiency and fidelity over state-of-the-art discrete audio language models, facilitating lightweight, high-quality audio generation. Samples are available at hf.co/spaces/kyutai/calm-samples. Finally, we release Pocket TTS, an open-source 100M-parameter text-to-speech model that can run faster than real time on a laptop CPU: github.com/kyutai-labs/pocket-tts.
翻译:音频语言模型(ALM)通过将音频表示为离散标记序列,已成为语音和音乐生成的主导范式。然而,与可逆的文本标记不同,音频标记是从具有有限比特率的有损编解码器中提取的。因此,提高音频质量需要生成更多标记,这导致了保真度与计算成本之间的权衡。我们通过研究连续音频语言模型(CALM)来解决这一问题。这些模型实例化了一个大型Transformer主干网络,在每个时间步生成上下文嵌入。该序列信息随后作为条件输入多层感知机(MLP),通过一致性建模生成音频变分自编码器(VAE)的下一连续帧。通过避免有损压缩,CALM在较低计算成本下实现了比其离散对应模型更高的质量。在语音和音乐上的实验表明,相较于最先进的离散音频语言模型,CALM在效率和保真度方面均有提升,从而促进了轻量级、高质量的音频生成。样本可在hf.co/spaces/kyutai/calm-samples获取。最后,我们发布了Pocket TTS,这是一个开源的1亿参数文本转语音模型,可在笔记本电脑CPU上以快于实时速度运行:github.com/kyutai-labs/pocket-tts。