Despite rapid progress in the voice style transfer (VST) field, recent zero-shot VST systems still lack the ability to transfer the voice style of a novel speaker. In this paper, we present HierVST, a hierarchical adaptive end-to-end zero-shot VST model. Without any text transcripts, we only use the speech dataset to train the model by utilizing hierarchical variational inference and self-supervised representation. In addition, we adopt a hierarchical adaptive generator that generates the pitch representation and waveform audio sequentially. Moreover, we utilize unconditional generation to improve the speaker-relative acoustic capacity in the acoustic representation. With a hierarchical adaptive structure, the model can adapt to a novel voice style and convert speech progressively. The experimental results demonstrate that our method outperforms other VST models in zero-shot VST scenarios. Audio samples are available at \url{https://hiervst.github.io/}.
翻译:尽管语音风格迁移(VST)领域取得了快速发展,但现有的零样本VST系统仍缺乏对陌生说话人语音风格进行迁移的能力。本文提出HierVST——一种层级自适应端到端零样本VST模型。在不依赖任何文本标注的情况下,我们仅利用语音数据集,通过层级变分推断与自监督表示对模型进行训练。此外,我们采用层级自适应生成器,依次生成音高表示与波形音频。同时,利用无条件生成机制提升声学表示中与说话人相关的声学容量。通过层级自适应结构,模型能适应新语音风格并逐步完成语音转换。实验结果表明,在零样本VST场景下,我们的方法优于其他VST模型。音频样本可在 \url{https://hiervst.github.io/} 获取。