Using unsupervised learning to disentangle speech into content, rhythm, pitch, and timbre for voice conversion has become a hot research topic. Existing works generally take into account disentangling speech components through human-crafted bottleneck features which can not achieve sufficient information disentangling, while pitch and rhythm may still be mixed together. There is a risk of information overlap in the disentangling process which results in less speech naturalness. To overcome such limits, we propose a two-stage model to disentangle speech representations in a self-supervised manner without a human-crafted bottleneck design, which uses the Mutual Information (MI) with the designed upper bound estimator (IFUB) to separate overlapping information between speech components. Moreover, we design a Joint Text-Guided Consistent (TGC) module to guide the extraction of speech content and eliminate timbre leakage issues. Experiments show that our model can achieve a better performance than the baseline, regarding disentanglement effectiveness, speech naturalness, and similarity. Audio samples can be found at https://largeaudiomodel.com/eadvc.
翻译:利用无监督学习将语音解耦为内容、韵律、音高和音色以进行语音转换已成为研究热点。现有方法通常通过人工设计的瓶颈特征解耦语音成分,但这种方法难以实现充分的信息解耦,且音高与韵律可能仍存在混杂。解耦过程中存在信息重叠的风险,导致语音自然度下降。为克服这些局限,我们提出一种两阶段模型,以自监督方式解耦语音表征,无需人工设计的瓶颈结构。该模型利用互信息(MI)及所设计的IFUB上界估计器分离语音成分间的重叠信息。此外,我们设计了一种联合文本引导一致(TGC)模块,以引导语音内容的提取并消除音色泄露问题。实验表明,我们的模型在解耦效果、语音自然度和相似性方面均优于基线方法。音频样本可访问https://largeaudiomodel.com/eadvc获取。