Voice conversion is an increasingly popular technology, and the growing number of real-time applications requires models with streaming conversion capabilities. Unlike typical (non-streaming) voice conversion, which can leverage the entire utterance as full context, streaming voice conversion faces significant challenges due to the missing future information, resulting in degraded intelligibility, speaker similarity, and sound quality. To address this challenge, we propose DualVC, a dual-mode neural voice conversion approach that supports both streaming and non-streaming modes using jointly trained separate network parameters. Furthermore, we propose intra-model knowledge distillation and hybrid predictive coding (HPC) to enhance the performance of streaming conversion. Additionally, we incorporate data augmentation to train a noise-robust autoregressive decoder, improving the model's performance on long-form speech conversion. Experimental results demonstrate that the proposed model outperforms the baseline models in the context of streaming voice conversion, while maintaining comparable performance to the non-streaming topline system that leverages the complete context, albeit with a latency of only 252.8 ms.
翻译:摘要:语音转换是一项日益普及的技术,而实时应用的增多要求模型具备流式转换能力。与可利用整段话语作为完整上下文的典型(非流式)语音转换不同,流式语音转换因缺乏未来信息而面临巨大挑战,导致可懂度、说话人相似度及音质下降。为应对这一挑战,我们提出DualVC——一种双模式神经语音转换方法,通过联合训练独立的网络参数同时支持流式与非流式模式。此外,我们引入模型内知识蒸馏与混合预测编码(HPC)以增强流式转换性能。同时,采用数据增强技术训练噪声鲁棒的自回归解码器,提升模型对长语音转换的表现。实验结果表明,该模型在流式语音转换任务中优于基线模型,且性能与利用完整上下文的非流式顶端系统相当,延迟仅为252.8毫秒。