Building upon advancements in Large Language Models (LLMs), the field of audio processing has seen increased interest in training audio generation tasks with discrete audio token sequences. However, directly discretizing audio by neural audio codecs often results in sequences that fundamentally differ from text sequences. Unlike text, where text token sequences are deterministic, discrete audio tokens can exhibit significant variability based on contextual factors, while still producing perceptually identical audio segments. We refer to this phenomenon as \textbf{Discrete Representation Inconsistency (DRI)}. This inconsistency can lead to a single audio segment being represented by multiple divergent sequences, which creates confusion in neural codec language models and results in omissions and repetitions during speech generation. In this paper, we quantitatively analyze the DRI phenomenon within popular audio tokenizers such as EnCodec. Our approach effectively mitigates the DRI phenomenon of the neural audio codec. Furthermore, extensive experiments on the neural codec language model over LibriTTS and large-scale MLS datases (44,000 hours) demonstrate the effectiveness and generality of our method. The demo of audio samples is available online~\footnote{\url{https://consistencyinneuralcodec.github.io}}.
翻译:基于大语言模型(LLM)的进展,音频处理领域对使用离散音频令牌序列训练音频生成任务的兴趣日益增长。然而,通过神经音频编解码器直接离散化音频通常会产生与文本序列根本不同的序列。与文本不同(其文本令牌序列是确定性的),离散音频令牌会因上下文因素而表现出显著的变异性,同时仍能产生感知上相同的音频片段。我们将这种现象称为**离散表示不一致性(DRI)**。这种不一致性可能导致单个音频片段由多个不同的序列表示,从而在神经编解码语言模型中造成混淆,并在语音生成过程中导致遗漏和重复。在本文中,我们对流行音频令牌化器(如EnCodec)中的DRI现象进行了定量分析。我们的方法有效地缓解了神经音频编解码器的DRI现象。此外,在LibriTTS和大规模MLS数据集(44,000小时)上对神经编解码语言模型进行的广泛实验证明了我们方法的有效性和通用性。音频样本演示可在网上获取~\footnote{\url{https://consistencyinneuralcodec.github.io}}。