Language models (LMs) have recently flourished in natural language processing and computer vision, generating high-fidelity texts or images in various tasks. In contrast, the current speech generative models are still struggling regarding speech quality and task generalization. This paper presents Vec-Tok Speech, an extensible framework that resembles multiple speech generation tasks, generating expressive and high-fidelity speech. Specifically, we propose a novel speech codec based on speech vectors and semantic tokens. Speech vectors contain acoustic details contributing to high-fidelity speech reconstruction, while semantic tokens focus on the linguistic content of speech, facilitating language modeling. Based on the proposed speech codec, Vec-Tok Speech leverages an LM to undertake the core of speech generation. Moreover, Byte-Pair Encoding (BPE) is introduced to reduce the token length and bit rate for lower exposure bias and longer context coverage, improving the performance of LMs. Vec-Tok Speech can be used for intra- and cross-lingual zero-shot voice conversion (VC), zero-shot speaking style transfer text-to-speech (TTS), speech-to-speech translation (S2ST), speech denoising, and speaker de-identification and anonymization. Experiments show that Vec-Tok Speech, built on 50k hours of speech, performs better than other SOTA models. Code will be available at https://github.com/BakerBunker/VecTok .
翻译:语言模型近来在自然语言处理和计算机视觉领域蓬勃发展,能够在各类任务中生成高保真的文本或图像。相比之下,当前的语音生成模型在语音质量和任务泛化方面仍面临挑战。本文提出Vec-Tok Speech,一个可扩展的框架,能够统一多种语音生成任务,生成富有表现力且高保真的语音。具体而言,我们基于语音向量和语义标记提出了一种新型语音编解码器。语音向量包含有助于高保真语音重建的声学细节,而语义标记则聚焦于语音的语言学内容,便于语言建模。基于所提出的语音编解码器,Vec-Tok Speech利用语言模型承担语音生成的核心任务。此外,我们引入字节对编码以降低标记长度和比特率,从而减少曝光偏差并扩展上下文覆盖范围,进而提升语言模型的性能。Vec-Tok Speech可用于跨语言与语种内零样本语音转换、零样本说话风格迁移文本到语音、语音到语音翻译、语音降噪以及说话人去标识化与匿名化。实验表明,基于5万小时语音数据构建的Vec-Tok Speech性能优于其他现有最优模型。代码将发布于https://github.com/BakerBunker/VecTok。