Singing voice conversion (SVC) is hindered by noise sensitivity due to the use of non-robust methods for extracting pitch and energy during the inference. As clean signals are key for the source audio in SVC, music source separation preprocessing offers a viable solution for handling noisy audio, like singing with background music (BGM). However, current separating methods struggle to fully remove noise or excessively suppress signal components, affecting the naturalness and similarity of the processed audio. To tackle this, our study introduces RobustSVC, a novel any-to-one SVC framework that converts noisy vocals into clean vocals sung by the target singer. We replace the non-robust feature with a HuBERT-based melody extractor and use adversarial training mechanisms with three discriminators to reduce information leakage in self-supervised representations. Experimental results show that RobustSVC is noise-robust and achieves higher similarity and naturalness than baseline methods in both noisy and clean vocal conditions.
翻译:歌声转换(SVC)在推理过程中因使用非鲁棒的方法提取音高和能量而受到噪声敏感性的阻碍。由于干净信号是SVC中源音频的关键,音乐源分离预处理为处理含噪音频(如带有背景音乐的歌声)提供了一种可行的解决方案。然而,现有的分离方法难以完全去除噪声或过度抑制信号成分,影响了处理后音频的自然度和相似性。为解决此问题,本研究提出了RobustSVC,一种新颖的任意到一SVC框架,可将含噪人声转换为目标歌手演唱的干净人声。我们使用基于HuBERT的旋律提取器替代非鲁棒特征,并采用具有三个判别器的对抗训练机制来减少自监督表示中的信息泄露。实验结果表明,RobustSVC具有噪声鲁棒性,并且在含噪与干净人声条件下均比基线方法取得了更高的相似度和自然度。