With recent rapid growth of large language models (LLMs), discrete speech tokenization has played an important role for injecting speech into LLMs. However, this discretization gives rise to a loss of information, consequently impairing overall performance. To improve the performance of these discrete speech tokens, we present RepCodec, a novel speech representation codec for semantic speech tokenization. In contrast to audio codecs which reconstruct the raw audio, RepCodec learns a vector quantization codebook through reconstructing speech representations from speech encoders like HuBERT or data2vec. Together, the speech encoder, the codec encoder and the vector quantization codebook form a pipeline for converting speech waveforms into semantic tokens. The extensive experiments illustrate that RepCodec, by virtue of its enhanced information retention capacity, significantly outperforms the widely used k-means clustering approach in both speech understanding and generation. Furthermore, this superiority extends across various speech encoders and languages, affirming the robustness of RepCodec. We believe our method can facilitate large language modeling research on speech processing.
翻译:随着大型语言模型(LLM)的快速发展,离散语音分词在将语音注入LLM中发挥了重要作用。然而,这种离散化会导致信息丢失,从而损害整体性能。为提升这些离散语音标记的性能,我们提出RepCodec,一种用于语义语音分词的新型语音表示编解码器。与重建原始音频的音频编解码器不同,RepCodec通过从HuBERT或data2vec等语音编码器重建语音表示来学习向量量化码本。语音编码器、编解码器编码器与向量量化码本共同构成一个将语音波形转换为语义标记的流水线。大量实验表明,RepCodec凭借其更强的信息保留能力,在语音理解和生成任务上均显著优于广泛使用的k-means聚类方法。此外,这种优势在不同语音编码器和语言中均得以保持,证实了RepCodec的鲁棒性。我们相信,本方法可促进语音处理领域的大语言模型研究。