Audio language models have recently emerged as a promising approach for various audio generation tasks, relying on audio tokenizers to encode waveforms into sequences of discrete symbols. Audio tokenization often poses a necessary compromise between code bitrate and reconstruction accuracy. When dealing with low-bitrate audio codes, language models are constrained to process only a subset of the information embedded in the audio, which in turn restricts their generative capabilities. To circumvent these issues, we propose encoding audio as vector sequences in continuous space $\mathbb R^d$ and autoregressively generating these sequences using a decoder-only diffusion transformer (ARDiT). Our findings indicate that ARDiT excels in zero-shot text-to-speech and exhibits performance that compares to or even surpasses that of state-of-the-art models. High-bitrate continuous speech representation enables almost flawless reconstruction, allowing our model to achieve nearly perfect speech editing. Our experiments reveal that employing Integral Kullback-Leibler (IKL) divergence for distillation at each autoregressive step significantly boosts the perceived quality of the samples. Simultaneously, it condenses the iterative sampling process of the diffusion model into a single step. Furthermore, ARDiT can be trained to predict several continuous vectors in one step, significantly reducing latency during sampling. Impressively, one of our models can generate $170$ ms of $24$ kHz speech per evaluation step with minimal degradation in performance. Audio samples are available at http://ardit-tts.github.io/ .
翻译:音频语言模型近期已成为多种音频生成任务的有前景方法,其依赖音频分词器将波形编码为离散符号序列。音频分词化通常需要在码率和重构精度之间做出必要妥协。当处理低比特率音频编码时,语言模型只能处理音频中嵌入的部分信息,这限制了其生成能力。为规避这些问题,我们提出将音频编码为连续空间 $\mathbb R^d$ 中的向量序列,并使用仅解码器的扩散Transformer(ARDiT)自回归生成这些序列。实验结果表明,ARDiT在零样本文本到语音合成中表现优异,其性能可媲美甚至超越最先进模型。高比特率连续语音表示可实现近乎完美的重构,使我们的模型能够实现近乎完美的语音编辑。实验发现,在每个自回归步骤采用积分KL散度(IKL)进行蒸馏,能显著提升样本的感知质量,同时将扩散模型的迭代采样过程压缩为单步操作。此外,ARDiT可训练为单步预测多个连续向量,显著降低采样延迟。值得注意的是,我们某个模型在每个评估步骤可生成 $170$ 毫秒的 $24$ kHz语音,且性能下降极小。音频样本可在 http://ardit-tts.github.io/ 获取。