We present VoiceShop, a novel speech-to-speech framework that can modify multiple attributes of speech, such as age, gender, accent, and speech style, in a single forward pass while preserving the input speaker's timbre. Previous works have been constrained to specialized models that can only edit these attributes individually and suffer from the following pitfalls: the magnitude of the conversion effect is weak, there is no zero-shot capability for out-of-distribution speakers, or the synthesized outputs exhibit timbre leakage which changes the speaker's perceived identity. Our work proposes solutions for each of these issues in a simple modular framework based on a conditional diffusion backbone model with optional normalizing flow-based and sequence-to-sequence speaker attribute-editing modules, whose components can be combined or removed during inference to meet a wide array of tasks without additional model finetuning. Audio samples are available at https://voiceshopai.github.io
翻译:我们提出VoiceShop,一种新型语音到语音框架,可在单次前向传播过程中修改语音的多种属性(如年龄、性别、口音和说话风格),同时保留输入说话人的音色。以往的工作受限于专用模型,只能单独编辑这些属性,并存在以下问题:转换效果幅度较弱、缺乏对分布外说话人的零样本能力,或合成输出存在音色泄漏从而改变说话人的感知身份。我们的工作针对这些问题提出了解决方案,采用基于条件扩散骨干模型与可选归一化流和序列到序列说话人属性编辑模块的简单模块化框架,其组件可在推理过程中组合或移除,以在无需额外模型微调的情况下满足多种任务需求。音频样本见https://voiceshopai.github.io