Singing voice beautifying is a novel task that has application value in people's daily life, aiming to correct the pitch of the singing voice and improve the expressiveness without changing the original timbre and content. Existing methods rely on paired data or only concentrate on the correction of pitch. However, professional songs and amateur songs from the same person are hard to obtain, and singing voice beautifying doesn't only contain pitch correction but other aspects like emotion and rhythm. Since we propose a fast and high-fidelity singing voice beautifying system called ConTuner, a diffusion model combined with the modified condition to generate the beautified Mel-spectrogram, where the modified condition is composed of optimized pitch and expressiveness. For pitch correction, we establish a mapping relationship from MIDI, spectrum envelope to pitch. To make amateur singing more expressive, we propose the expressiveness enhancer in the latent space to convert amateur vocal tone to professional. ConTuner achieves a satisfactory beautification effect on both Mandarin and English songs. Ablation study demonstrates that the expressiveness enhancer and generator-based accelerate method in ConTuner are effective.
翻译:歌声美化是一项具有日常生活应用价值的新任务,旨在不改变原始音色和内容的前提下校正歌声音高并提升表现力。现有方法依赖配对数据或仅专注于音高校正。然而,同一个人演唱的专业版与业余版歌曲难以获取,且歌声美化不仅包含音高校正,还涉及情感、节奏等维度。为此,我们提出名为ConTuner的快速高保真歌声美化系统——通过扩散模型结合改良条件生成美化后的梅尔频谱图,其中改良条件由优化后的音高与表现力构成。针对音高校正,我们建立了从MIDI、频谱包络到音高的映射关系。为使业余演唱更具表现力,我们在隐空间中提出表现力增强器,将业余人声转化为专业级音色。ConTuner在中文与英文歌曲上均取得了满意的美化效果。消融实验表明,ConTuner中的表现力增强器与基于生成器的加速方法切实有效。