Singing Voice Synthesis (SVS) has witnessed significant advancements with the advent of deep learning techniques. However, a significant challenge in SVS is the scarcity of labeled singing voice data, which limits the effectiveness of supervised learning methods. In response to this challenge, this paper introduces a novel approach to enhance the quality of SVS by leveraging unlabeled data from pre-trained self-supervised learning models. Building upon the existing VISinger2 framework, this study integrates additional spectral feature information into the system to enhance its performance. The integration aims to harness the rich acoustic features from the pre-trained models, thereby enriching the synthesis and yielding a more natural and expressive singing voice. Experimental results in various corpora demonstrate the efficacy of this approach in improving the overall quality of synthesized singing voices in both objective and subjective metrics.
翻译:随着深度学习技术的兴起,歌声合成领域取得了显著进展。然而,歌声合成面临的一个主要挑战是标注歌声数据的稀缺,这限制了监督学习方法的有效性。针对这一挑战,本文提出了一种新方法,通过利用来自预训练自监督学习模型的未标注数据来提升歌声合成的质量。本研究在现有VISinger2框架的基础上,将额外的频谱特征信息集成到系统中以提升其性能。该集成旨在利用预训练模型中丰富的声学特征,从而丰富合成过程,产生更自然、更具表现力的歌声。在不同数据集上的实验结果表明,该方法在客观和主观指标上均能有效提升合成歌声的整体质量。