Voice Conversion (VC) must be achieved while maintaining the content of the source speech and representing the characteristics of the target speaker. The existing methods do not simultaneously satisfy the above two aspects of VC, and their conversion outputs suffer from a trade-off problem between maintaining source contents and target characteristics. In this study, we propose Triple Adaptive Attention Normalization VC (TriAAN-VC), comprising an encoder-decoder and an attention-based adaptive normalization block, that can be applied to non-parallel any-to-any VC. The proposed adaptive normalization block extracts target speaker representations and achieves conversion while minimizing the loss of the source content with siamese loss. We evaluated TriAAN-VC on the VCTK dataset in terms of the maintenance of the source content and target speaker similarity. Experimental results for one-shot VC suggest that TriAAN-VC achieves state-of-the-art performance while mitigating the trade-off problem encountered in the existing VC methods.
翻译:语音转换需在保持源语音内容的同时呈现目标说话人特征。现有方法无法同时满足上述两方面要求,其转换输出在源内容保持与目标特征表征之间存在权衡问题。本研究提出三重自适应注意力归一化语音转换方法(TriAAN-VC),包含编码器-解码器与基于注意力的自适应归一化模块,可适用于非平行任意到任意语音转换任务。所提自适应归一化模块通过孪生损失提取目标说话人表征并在最小化源内容损失的同时实现转换。我们在VCTK数据集上从源内容保持与目标说话人相似度两个维度评估TriAAN-VC。单次语音转换实验表明,TriAAN-VC在缓解现有方法权衡问题的同时达到了最先进性能。