LM-VC: Zero-shot Voice Conversion via Speech Generation based on Language Models

Language model (LM) based audio generation frameworks, e.g., AudioLM, have recently achieved new state-of-the-art performance in zero-shot audio generation. In this paper, we explore the feasibility of LMs for zero-shot voice conversion. An intuitive approach is to follow AudioLM - Tokenizing speech into semantic and acoustic tokens respectively by HuBERT and SoundStream, and converting source semantic tokens to target acoustic tokens conditioned on acoustic tokens of the target speaker. However, such an approach encounters several issues: 1) the linguistic content contained in semantic tokens may get dispersed during multi-layer modeling while the lengthy speech input in the voice conversion task makes contextual learning even harder; 2) the semantic tokens still contain speaker-related information, which may be leaked to the target speech, lowering the target speaker similarity; 3) the generation diversity in the sampling of the LM can lead to unexpected outcomes during inference, leading to unnatural pronunciation and speech quality degradation. To mitigate these problems, we propose LM-VC, a two-stage language modeling approach that generates coarse acoustic tokens for recovering the source linguistic content and target speaker's timbre, and then reconstructs the fine for acoustic details as converted speech. Specifically, to enhance content preservation and facilitates better disentanglement, a masked prefix LM with a mask prediction strategy is used for coarse acoustic modeling. This model is encouraged to recover the masked content from the surrounding context and generate target speech based on the target speaker's utterance and corrupted semantic tokens. Besides, to further alleviate the sampling error in the generation, an external LM, which employs window attention to capture the local acoustic relations, is introduced to participate in the coarse acoustic modeling.

翻译：基于语言模型（LM）的音频生成框架（如AudioLM）近期在零样本音频生成任务中取得了新突破。本文探索了语言模型在零样本语音转换中的可行性。一种直观的方法是借鉴AudioLM的思路——通过HuBERT将语音分解为语义令牌和声学令牌，再以目标说话人的声学令牌为条件，将源语义令牌转换为目标声学令牌。然而，该方法存在以下问题：1）多层建模过程中语义令牌包含的语言内容可能发生分散，而语音转换任务中较长的语音输入进一步增加了上下文学习的难度；2）语义令牌仍包含说话人相关特征，这些特征可能泄露至目标语音，降低目标说话人相似度；3）语言模型采样时产生的生成多样性可能导致推理阶段出现意外结果，从而引发不自然发音和语音质量下降。针对上述问题，我们提出LM-VC——一种两阶段语言建模方法，首先生成粗粒度声学令牌以恢复源语言内容与目标说话人音色，再重建细粒度声学细节作为转换后的语音。具体而言，为增强内容保持性并促进更好的解耦，粗粒度声学建模采用带掩码前缀机制的语言模型及掩码预测策略。该模型通过上下文预测掩码内容，并基于目标说话人语句和受损语义令牌生成目标语音。此外，为缓解生成阶段的采样误差，引入采用窗口注意力机制捕捉局部声学关系的外部语言模型，参与粗粒度声学建模。