With recent rapid growth of large language models (LLMs), discrete speech tokenization has played an important role for injecting speech into LLMs. However, this discretization gives rise to a loss of information, consequently impairing overall performance. To improve the performance of these discrete speech tokens, we present RepCodec, a novel speech representation codec for semantic speech tokenization. In contrast to audio codecs which reconstruct the raw audio, RepCodec learns a vector quantization codebook through reconstructing speech representations from speech encoders like HuBERT or data2vec. Together, the speech encoder, the codec encoder and the vector quantization codebook form a pipeline for converting speech waveforms into semantic tokens. The extensive experiments illustrate that RepCodec, by virtue of its enhanced information retention capacity, significantly outperforms the widely used k-means clustering approach in both speech understanding and generation. Furthermore, this superiority extends across various speech encoders and languages, affirming the robustness of RepCodec. We believe our method can facilitate large language modeling research on speech processing.
翻译:随着大型语言模型(LLM)的迅速发展,离散语音分词技术已成为将语音融入LLM的关键环节。然而,这种离散化处理会导致信息损失,进而影响整体性能。为提升离散语音词元的性能,我们提出RepCodec——一种面向语义语音分词的新型语音表示编解码器。与重构原始音频的音频编解码器不同,RepCodec通过重建来自HuBERT或data2vec等语音编码器的语音表示,学习向量量化码本。语音编码器、编解码器编码器与向量量化码本共同构成将语音波形转换为语义词元的流水线。大量实验表明,RepCodec凭借其增强的信息保留能力,在语音理解与生成任务上均显著优于广泛使用的k-means聚类方法。此外,这种优越性在多种语音编码器与语言中均得到验证,证实了RepCodec的鲁棒性。我们相信该方法能促进语音处理领域的大规模语言建模研究。