StableVC: Style Controllable Zero-Shot Voice Conversion with Conditional Flow Matching

Zero-shot voice conversion (VC) aims to transfer the timbre from the source speaker to an arbitrary unseen speaker while preserving the original linguistic content. Despite recent advancements in zero-shot VC using language model-based or diffusion-based approaches, several challenges remain: 1) current approaches primarily focus on adapting timbre from unseen speakers and are unable to transfer style and timbre to different unseen speakers independently; 2) these approaches often suffer from slower inference speeds due to the autoregressive modeling methods or the need for numerous sampling steps; 3) the quality and similarity of the converted samples are still not fully satisfactory. To address these challenges, we propose a style controllable zero-shot VC approach named StableVC, which aims to transfer timbre and style from source speech to different unseen target speakers. Specifically, we decompose speech into linguistic content, timbre, and style, and then employ a conditional flow matching module to reconstruct the high-quality mel-spectrogram based on these decomposed features. To effectively capture timbre and style in a zero-shot manner, we introduce a novel dual attention mechanism with an adaptive gate, rather than using conventional feature concatenation. With this non-autoregressive design, StableVC can efficiently capture the intricate timbre and style from different unseen speakers and generate high-quality speech significantly faster than real-time. Experiments demonstrate that our proposed StableVC outperforms state-of-the-art baseline systems in zero-shot VC and achieves flexible control over timbre and style from different unseen speakers. Moreover, StableVC offers approximately 25x and 1.65x faster sampling compared to autoregressive and diffusion-based baselines.

翻译：零样本语音转换（VC）旨在将源说话者的音色迁移至任意未见说话者，同时保持原始语言内容。尽管基于语言模型或扩散模型的零样本VC方法近期取得进展，仍存在以下挑战：1）现有方法主要关注从未见说话者适配音色，无法独立地将样式与音色迁移至不同的未见说话者；2）由于自回归建模方法或需要大量采样步骤，这些方法的推理速度通常较慢；3）转换样本的质量与相似度仍不完全理想。为解决这些挑战，我们提出一种名为StableVC的样式可控零样本VC方法，旨在将源语音的音色与样式迁移至不同的未见目标说话者。具体而言，我们将语音分解为语言内容、音色和样式，随后采用条件流匹配模块基于这些分解特征重建高质量梅尔频谱。为以零样本方式有效捕获音色与样式，我们引入一种带自适应门控的新型双重注意力机制，而非使用传统的特征拼接方法。凭借这种非自回归设计，StableVC能够高效捕获不同未见说话者的复杂音色与样式，并以显著快于实时的速度生成高质量语音。实验表明，所提出的StableVC在零样本VC任务中优于现有先进基线系统，并能灵活控制不同未见说话者的音色与样式。此外，与自回归及扩散基线相比，StableVC的采样速度分别提升约25倍和1.65倍。