Style voice conversion aims to transform the style of source speech to a desired style according to real-world application demands. However, the current style voice conversion approach relies on pre-defined labels or reference speech to control the conversion process, which leads to limitations in style diversity or falls short in terms of the intuitive and interpretability of style representation. In this study, we propose PromptVC, a novel style voice conversion approach that employs a latent diffusion model to generate a style vector driven by natural language prompts. Specifically, the style vector is extracted by a style encoder during training, and then the latent diffusion model is trained independently to sample the style vector from noise, with this process being conditioned on natural language prompts. To improve style expressiveness, we leverage HuBERT to extract discrete tokens and replace them with the K-Means center embedding to serve as the linguistic content, which minimizes residual style information. Additionally, we deduplicate the same discrete token and employ a differentiable duration predictor to re-predict the duration of each token, which can adapt the duration of the same linguistic content to different styles. The subjective and objective evaluation results demonstrate the effectiveness of our proposed system.
翻译:风格语音转换旨在根据实际应用需求将源语音的风格转换为目标风格。然而,当前的风格语音转换方法依赖预定义标签或参考语音来控制转换过程,导致风格多样性受限,或缺乏风格表示的直观性与可解释性。本研究提出PromptVC——一种基于潜在扩散模型、通过自然语言提示驱动生成风格向量的新型风格语音转换方法。具体而言,训练阶段通过风格编码器提取风格向量,随后独立训练潜在扩散模型使其能够从噪声中采样风格向量,且该过程受自然语言提示的约束。为提升风格表现力,我们利用HuBERT提取离散令牌并将其替换为K-Means中心嵌入作为语言内容,从而最小化残余风格信息。此外,我们对相同离散令牌进行去重处理,并采用可微分的时长预测器重新预测每个令牌的时长,使相同语言内容能够适应不同风格的时长特征。主观与客观评估结果验证了所提系统的有效性。