Voice conversion is an increasingly popular technology, and the growing number of real-time applications requires models with streaming conversion capabilities. Unlike typical (non-streaming) voice conversion, which can leverage the entire utterance as full context, streaming voice conversion faces significant challenges due to the missing future information, resulting in degraded intelligibility, speaker similarity, and sound quality. To address this challenge, we propose DualVC, a dual-mode neural voice conversion approach that supports both streaming and non-streaming modes using jointly trained separate network parameters. Furthermore, we propose intra-model knowledge distillation and hybrid predictive coding (HPC) to enhance the performance of streaming conversion. Additionally, we incorporate data augmentation to train a noise-robust autoregressive decoder, improving the model's performance on long-form speech conversion. Experimental results demonstrate that the proposed model outperforms the baseline models in the context of streaming voice conversion, while maintaining comparable performance to the non-streaming topline system that leverages the complete context, albeit with a latency of only 252.8 ms.
翻译:摘要:语音转换是一项日益流行的技术,而实时应用的增多要求模型具备流式转换能力。与利用全段语音作为完整上下文的典型(非流式)语音转换不同,流式语音转换因缺失未来信息而面临显著挑战,导致可懂度、说话人相似度及音质下降。为解决此问题,我们提出DualVC——一种支持流式与非流式双模态的神经语音转换方法,该方法通过联合训练独立的网络参数实现。此外,我们引入模型内知识蒸馏与混合预测编码(HPC)增强流式转换性能。同时,结合数据增强技术训练噪声鲁棒的自回归解码器,提升模型在长语音片段转换中的表现。实验结果表明,所提模型在流式语音转换场景中优于基线模型,且与利用完整上下文的非流式顶级系统性能相当,而延迟仅为252.8毫秒。