Research on deep learning-powered voice conversion (VC) in speech-to-speech scenarios is getting increasingly popular. Although many of the works in the field of voice conversion share a common global pipeline, there is a considerable diversity in the underlying structures, methods, and neural sub-blocks used across research efforts. Thus, obtaining a comprehensive understanding of the reasons behind the choice of the different methods in the voice conversion pipeline can be challenging, and the actual hurdles in the proposed solutions are often unclear. To shed light on these aspects, this paper presents a scoping review that explores the use of deep learning in speech analysis, synthesis, and disentangled speech representation learning within modern voice conversion systems. We screened 621 publications from more than 38 different venues between the years 2017 and 2023, followed by an in-depth review of a final database consisting of 123 eligible studies. Based on the review, we summarise the most frequently used approaches to voice conversion based on deep learning and highlight common pitfalls within the community. Lastly, we condense the knowledge gathered, identify main challenges and provide recommendations for future research directions.
翻译:基于深度学习的语音转换在语音到语音场景中的研究日益流行。尽管语音转换领域的许多研究共享一个共同的全局流程,但不同研究工作所使用的底层结构、方法和神经子模块存在显著差异。因此,全面理解语音转换流程中不同方法选择背后的原因颇具挑战性,且所提解决方案中的实际障碍往往不够清晰。为阐明这些方面,本文呈现了一项范围综述,探讨了深度学习在现代语音转换系统中的语音分析、合成和解耦语音表示学习中的应用。我们筛选了2017年至2023年间38个以上不同来源的621篇出版物,并对最终由123项符合条件的研究组成的数据库进行了深入审查。基于此综述,我们总结了最常用的基于深度学习的语音转换方法,并强调了该领域内的常见误区。最后,我们提炼了所收集的知识,识别了主要挑战,并为未来研究方向提供了建议。