Language models (LMs) have recently flourished in natural language processing and computer vision, generating high-fidelity texts or images in various tasks. In contrast, the current speech generative models are still struggling regarding speech quality and task generalization. This paper presents Vec-Tok Speech, an extensible framework that resembles multiple speech generation tasks, generating expressive and high-fidelity speech. Specifically, we propose a novel speech codec based on speech vectors and semantic tokens. Speech vectors contain acoustic details contributing to high-fidelity speech reconstruction, while semantic tokens focus on the linguistic content of speech, facilitating language modeling. Based on the proposed speech codec, Vec-Tok Speech leverages an LM to undertake the core of speech generation. Moreover, Byte-Pair Encoding (BPE) is introduced to reduce the token length and bit rate for lower exposure bias and longer context coverage, improving the performance of LMs. Vec-Tok Speech can be used for intra- and cross-lingual zero-shot voice conversion (VC), zero-shot speaking style transfer text-to-speech (TTS), speech-to-speech translation (S2ST), speech denoising, and speaker de-identification and anonymization. Experiments show that Vec-Tok Speech, built on 50k hours of speech, performs better than other SOTA models. Code will be available at https://github.com/BakerBunker/VecTok .
翻译:语言模型在自然语言处理和计算机视觉领域近来蓬勃发展,能够生成高保真文本或图像并完成多种任务。相比之下,当前的语音生成模型在语音质量和任务泛化方面仍面临挑战。本文提出Vec-Tok Speech——一个可扩展的框架,能够统一多种语音生成任务,生成富有表现力且高保真的语音。具体而言,我们基于语音向量和语义标记提出一种新型语音编解码器。语音向量包含有助于高保真语音重建的声学细节,而语义标记则聚焦于语音的语言内容,便于进行语言建模。基于所提出的语音编解码器,Vec-Tok Speech利用语言模型承担语音生成的核心任务。此外,我们引入字节对编码来减少标记长度和比特率,以降低曝光偏差并扩大上下文覆盖范围,从而提升语言模型的性能。Vec-Tok Speech可用于语内和跨语种零样本语音转换、零样本说话风格迁移文本转语音、语音到语音翻译、语音去噪以及说话人去识别与匿名化。实验表明,基于5万小时语音数据训练的Vec-Tok Speech性能优于其他最先进模型。代码将在https://github.com/BakerBunker/VecTok 开源。