Streaming zero-shot voice conversion (VC) has become increasingly popular due to its potential for real-time applications. The recently proposed MeanVC achieves lightweight streaming zero-shot VC, but it has several limitations: its chunk-wise autoregressive denoising doubles the effective training sequence length, conversion quality degrades under small-chunk settings, and its timbre encoder directly relies on reference mel-spectrograms, making it sensitive to reference audio quality. To address these limitations we propose MeanVC 2. We introduce future-receptive chunking (FRC), which explicitly schedules past and future receptive fields across diffusion transformer decoder layers and removes clean-chunk teacher forcing. By incorporating bounded future context, FRC enables stable conversion with a 40 ms chunk size. We further introduce a universal timbre token encoder, which constructs a timbre representation from a global speaker embedding and retrieves fine-grained timbre cues via cross-attention, improving robustness to low-quality references and enhancing zero-shot speaker similarity. Experimental results show that MeanVC 2 significantly outperforms MeanVC, while reducing latency from 211 ms to 110 ms. Audio samples are publicly available. The source code will be publicly released.
翻译:流式零样本语音转换因其在实时应用中的潜力而日益流行。近期提出的MeanVC实现了轻量级流式零样本语音转换,但存在若干局限:其基于分块的逐块自回归去噪导致有效训练序列长度加倍,在小分块设置下转换质量下降,且其音色编码器直接依赖参考梅尔频谱图,对参考音频质量敏感。为解决这些问题,我们提出MeanVC 2。我们引入未来感知分块策略,该策略在扩散变换器解码器各层显式调度过去与未来的感受野,并移除干净分块教师强制机制。通过引入有界未来上下文,未来感知分块在40毫秒分块大小下实现稳定转换。我们进一步提出通用音色令牌编码器,通过全局说话人嵌入构建音色表示,并利用交叉注意力检索细粒度音色线索,从而提升对低质量参考的鲁棒性并增强零样本说话人相似度。实验结果表明,MeanVC 2显著优于MeanVC,同时将延迟从211毫秒降至110毫秒。音频样本已公开。源代码将公开发布。