Many existing works on voice conversion (VC) tasks use automatic speech recognition (ASR) models for ensuring linguistic consistency between source and converted samples. However, for the low-data resource domains, training a high-quality ASR remains to be a challenging task. In this work, we propose a novel iterative way of improving both the ASR and VC models. We first train an ASR model which is used to ensure content preservation while training a VC model. In the next iteration, the VC model is used as a data augmentation method to further fine-tune the ASR model and generalize it to diverse speakers. By iteratively leveraging the improved ASR model to train VC model and vice-versa, we experimentally show improvement in both the models. Our proposed framework outperforms the ASR and one-shot VC baseline models on English singing and Hindi speech domains in subjective and objective evaluations in low-data resource settings.
翻译:现有诸多语音转换(VC)任务借助自动语音识别(ASR)模型确保源语音与转换后样本的语义一致性。然而,在低资源数据领域,训练高性能ASR模型仍极具挑战性。本文提出一种新颖的迭代方法,可同时提升ASR与VC模型性能:首先训练ASR模型,用于在训练VC模型时保证内容保留;下一迭代阶段,VC模型作为数据增强手段,进一步微调ASR模型并使其泛化至多样化的说话人。通过交替利用改进后的ASR模型训练VC模型,反之亦然,实验表明两类模型性能均获得提升。在低资源数据场景下,本文提出的框架在英语歌唱与印地语口语域的主观与客观评测中,均优于ASR与单次VC基线模型。