Voice Conversion (VC) converts the voice of a source speech to that of a target while maintaining the source's content. Speech can be mainly decomposed into four components: content, timbre, rhythm and pitch. Unfortunately, most related works only take into account content and timbre, which results in less natural speech. Some recent works are able to disentangle speech into several components, but they require laborious bottleneck tuning or various hand-crafted features, each assumed to contain disentangled speech information. In this paper, we propose a VC model that can automatically disentangle speech into four components using only two augmentation functions, without the requirement of multiple hand-crafted features or laborious bottleneck tuning. The proposed model is straightforward yet efficient, and the empirical results demonstrate that our model can achieve a better performance than the baseline, regarding disentanglement effectiveness and speech naturalness.
翻译:语音转换(VC)旨在将源语音的声纹转换为目标语音的声纹,同时保留源语音的内容信息。语音主要由四个成分构成:内容、音色、节奏和音高。遗憾的是,大多数相关研究仅考虑内容和音色,导致合成语音的自然度不足。近期部分研究虽能实现语音的多成分解耦,但往往需要繁琐的瓶颈层调参或依赖多种人工设计的特征,且这些特征被预设为包含解耦后的语音信息。本文提出一种仅需两种增强函数即可自动将语音解耦为四种成分的语音转换模型,无需多种人工特征或繁琐的瓶颈层调参。该模型结构简洁且高效,实验结果表明,在解耦效果和语音自然度方面,本模型均优于基线方法。