Voice Conversion (VC) aims to modify a speaker's timbre while preserving linguistic content. While recent VC models achieve strong performance, most struggle in real-time streaming scenarios due to high latency, dependence on ASR modules, or complex speaker disentanglement, which often results in timbre leakage or degraded naturalness. We present SynthVC, a streaming end-to-end VC framework that directly learns speaker timbre transformation from synthetic parallel data generated by a pre-trained zero-shot VC model. This design eliminates the need for explicit content-speaker separation or recognition modules. Built upon a neural audio codec architecture, SynthVC supports low-latency streaming inference with high output fidelity. Experimental results show that SynthVC outperforms baseline streaming VC systems in both naturalness and speaker similarity, achieving an end-to-end latency of just 77.1 ms.
翻译:语音转换旨在改变说话者的音色,同时保留其语言内容。尽管近期的语音转换模型取得了优异的性能,但由于高延迟、对自动语音识别模块的依赖或复杂的说话人解耦机制,大多数模型在实时流式场景中表现不佳,这常常导致音色泄露或自然度下降。本文提出SynthVC,一种流式端到端语音转换框架,它直接通过预训练的零样本语音转换模型生成的合成平行数据来学习说话人音色转换。该设计无需显式的内容-说话人分离或识别模块。基于神经音频编解码器架构构建的SynthVC支持低延迟流式推理,同时保持高输出保真度。实验结果表明,SynthVC在自然度和说话人相似度上均优于基线流式语音转换系统,其端到端延迟仅为77.1毫秒。