Recent advances in singing voice synthesis (SVS) have attracted substantial attention from both academia and industry. With the advent of large language models and novel generative paradigms, producing controllable, high-fidelity singing voices has become an attainable goal. Yet the field still lacks a comprehensive survey that systematically analyzes deep-learning-based singing voice synthesis systems and their enabling technologies. To address the aforementioned issue, this survey first categorizes existing systems by task type and then organizes current architectures into two major paradigms: cascaded and end-to-end approaches. Moreover, we provide an in-depth analysis of core technologies, covering singing modeling and control techniques. Finally, we review relevant datasets, annotation tools, and evaluation benchmarks that support training and assessment. In appendix, we introduce training strategies and further discussion of SVS. This survey provides an up-to-date review of the literature on SVS models, which would be a useful reference for both researchers and engineers. Related materials are available at https://github.com/David-Pigeon/SyntheticSingers.
翻译:近年来,歌声合成领域的最新进展已引起学术界与工业界的广泛关注。随着大语言模型及新型生成范式的出现,生成可控且高保真的歌声已成为可实现的目标。然而,该领域仍缺乏对基于深度学习的歌声合成系统及其支撑技术进行系统性分析的全面综述。针对上述问题,本综述首先依据任务类型对现有系统进行分类,进而将当前架构归纳为两大主要范式:级联式方法与端到端方法。此外,我们对核心技术进行了深入分析,涵盖歌声建模与控制技术。最后,我们回顾了支持训练与评估的相关数据集、标注工具及评测基准。附录部分介绍了训练策略并进一步探讨了歌声合成的相关问题。本综述提供了对歌声合成模型文献的最新梳理,可为研究人员与工程师提供有价值的参考。相关材料可在 https://github.com/David-Pigeon/SyntheticSingers 获取。