With recent rapid growth of large language models (LLMs), discrete speech tokenization has played an important role for injecting speech into LLMs. However, this discretization gives rise to a loss of information, consequently impairing overall performance. To improve the performance of these discrete speech tokens, we present RepCodec, a novel speech representation codec for semantic speech tokenization. In contrast to audio codecs which reconstruct the raw audio, RepCodec learns a vector quantization codebook through reconstructing speech representations from speech encoders like HuBERT or data2vec. Together, the speech encoder, the codec encoder and the vector quantization codebook form a pipeline for converting speech waveforms into semantic tokens. The extensive experiments illustrate that RepCodec, by virtue of its enhanced information retention capacity, significantly outperforms the widely used k-means clustering approach in both speech understanding and generation. Furthermore, this superiority extends across various speech encoders and languages, affirming the robustness of RepCodec. We believe our method can facilitate large language modeling research on speech processing.
翻译:随着大语言模型(LLMs)的快速发展,离散语音标记化在将语音注入LLMs方面发挥了重要作用。然而,这种离散化会导致信息损失,从而损害整体性能。为了提高这些离散语音标记的性能,我们提出了RepCodec,一种用于语义语音标记化的新型语音表征编解码器。与重建原始音频的音频编解码器不同,RepCodec通过重建来自HuBERT或data2vec等语音编码器的语音表征来学习一个向量量化码本。语音编码器、编解码器编码器和向量量化码本共同构成一个将语音波形转换为语义标记的流程。大量实验表明,凭借其增强的信息保留能力,RepCodec在语音理解和生成任务上均显著优于广泛使用的k-means聚类方法。此外,这种优越性在不同的语音编码器和语言中均得以体现,证实了RepCodec的鲁棒性。我们相信我们的方法能够促进语音处理领域的大语言建模研究。