Cycle-consistent generative adversarial networks have been widely used in non-parallel voice conversion (VC). Their ability to learn mappings between source and target features without relying on parallel training data eliminates the need for temporal alignments. However, most methods decouple the conversion of acoustic features from synthesizing the audio signal by using separate models for conversion and waveform synthesis. This work unifies conversion and synthesis into a single model, thereby eliminating the need for a separate vocoder. By leveraging cycle-consistent training and a self-supervised auxiliary training task, our model is able to efficiently generate converted high-quality raw audio waveforms. Subjective listening tests show that our method outperforms the baseline in whispered speech conversion (up to 6.7% relative improvement), and mean opinion score predictions yield competitive results in conventional VC (between 0.5% and 2.4% relative improvement).
翻译:循环一致生成对抗网络已广泛应用于非平行语音转换中。该方法无需依赖平行训练数据即可学习源特征与目标特征之间的映射,从而消除了时间对齐的需求。然而,大多数方法将声学特征转换与音频信号合成解耦,分别使用独立的模型进行转换与波形合成。本研究将转换与合成统一为单一模型,从而省去独立的声码器。通过利用循环一致训练与自监督辅助训练任务,我们的模型能够高效生成转换后的高质量原始音频波形。主观听音测试表明,该方法在耳语语音转换中优于基线(相对提升最高达6.7%),且在常规语音转换中平均意见得分预测结果具有竞争力(相对提升介于0.5%至2.4%之间)。