In this paper, we propose a differentiable WORLD synthesizer and demonstrate its use in end-to-end audio style transfer tasks such as (singing) voice conversion and the DDSP timbre transfer task. Accordingly, our baseline differentiable synthesizer has no model parameters, yet it yields adequate synthesis quality. We can extend the baseline synthesizer by appending lightweight black-box postnets which apply further processing to the baseline output in order to improve fidelity. An alternative differentiable approach considers extraction of the source excitation spectrum directly, which can improve naturalness albeit for a narrower class of style transfer applications. The acoustic feature parameterization used by our approaches has the added benefit that it naturally disentangles pitch and timbral information so that they can be modeled separately. Moreover, as there exists a robust means of estimating these acoustic features from monophonic audio sources, it allows for parameter loss terms to be added to an end-to-end objective function, which can help convergence and/or further stabilize (adversarial) training.
翻译:本文提出一种可微分 WORLD 合成器,并展示了其在端到端音频风格迁移任务中的应用,例如(歌唱)语音转换和 DDSP 音色迁移任务。相应地,我们的基线可微分合成器虽无模型参数,却能产生足够的合成质量。我们可通过附加轻量级黑盒后置网络来扩展基线合成器,这些网络对基线输出进行进一步处理以提高保真度。另一种可微分方法考虑直接提取源激励频谱,这能提升自然度,但适用范围较窄,仅限于特定类型的风格迁移任务。我们方法所使用的声学特征参数化具有额外优势:能自然解耦音高与音色信息,从而可对二者分别建模。此外,由于存在从单声道音频源中稳健估计这些声学特征的手段,可允许将参数损失项添加到端到端目标函数中,这有助于收敛和/或进一步稳定(对抗性)训练。