Foreign accent conversion (FAC) is a special application of voice conversion (VC) which aims to convert the accented speech of a non-native speaker to a native-sounding speech with the same speaker identity. FAC is difficult since the native speech from the desired non-native speaker to be used as the training target is impossible to collect. In this work, we evaluate three recently proposed methods for ground-truth-free FAC, where all of them aim to harness the power of sequence-to-sequence (seq2seq) and non-parallel VC models to properly convert the accent and control the speaker identity. Our experimental evaluation results show that no single method was significantly better than the others in all evaluation axes, which is in contrast to conclusions drawn in previous studies. We also explain the effectiveness of these methods with the training input and output of the seq2seq model and examine the design choice of the non-parallel VC model, and show that intelligibility measures such as word error rates do not correlate well with subjective accentedness. Finally, our implementation is open-sourced to promote reproducible research and help future researchers improve upon the compared systems.
翻译:外国口音转换(FAC)是语音转换(VC)的一个特殊应用,旨在将非母语者的带口音语音转换为具有相同说话人身份的本土化语音。由于难以获取目标非母语者的本土语音作为训练真值数据,FAC面临挑战。本研究评估了三种近期提出的无真值数据FAC方法,这些方法均利用序列到序列(seq2seq)与非并行VC模型的能力,以实现口音转换与说话人身份控制。实验结果表明,没有任何单一方法在所有评估维度上显著优于其他方法,这与先前研究的结论形成对比。我们通过seq2seq模型的训练输入与输出解释了这些方法的有效性,并探讨了非并行VC模型的设计选择,同时表明词错误率等可理解性指标与主观口音感知的相关性较弱。最后,我们开源代码以促进可重复性研究,并帮助未来研究人员对当前对比系统进行改进。