We present VoiceShop, a novel speech-to-speech framework that can modify multiple attributes of speech, such as age, gender, accent, and speech style, in a single forward pass while preserving the input speaker's timbre. Previous works have been constrained to specialized models that can only edit these attributes individually and suffer from the following pitfalls: the magnitude of the conversion effect is weak, there is no zero-shot capability for out-of-distribution speakers, or the synthesized outputs exhibit undesirable timbre leakage. Our work proposes solutions for each of these issues in a simple modular framework based on a conditional diffusion backbone model with optional normalizing flow-based and sequence-to-sequence speaker attribute-editing modules, whose components can be combined or removed during inference to meet a wide array of tasks without additional model finetuning. Audio samples are available at \url{https://voiceshopai.github.io}.
翻译:我们提出VoiceShop,一种新颖的语音到语音框架,能够在单次前向传播中修改语音的多个属性(如年龄、性别、口音和说话风格),同时保留输入说话人的音色。以往工作局限于仅能单独编辑这些属性的专用模型,并存在以下缺陷:转换效果强度弱、缺乏针对分布外说话人的零样本能力,或合成输出出现不良音色泄漏。本研究针对上述问题提出解决方案,基于条件扩散主干模型及可选的正则化流和序列到序列说话人属性编辑模块,构建了一个简洁的模块化框架。该框架的各组件可在推理时组合或移除,无需额外模型微调即可满足广泛任务需求。音频示例见\url{https://voiceshopai.github.io}。