Note-level Automatic Singing Voice Transcription (AST) converts singing recordings into note sequences, facilitating the automatic annotation of singing datasets for Singing Voice Synthesis (SVS) applications. Current AST methods, however, struggle with accuracy and robustness when used for practical annotation. This paper presents ROSVOT, the first robust AST model that serves SVS, incorporating a multi-scale framework that effectively captures coarse-grained note information and ensures fine-grained frame-level segmentation, coupled with an attention-based pitch decoder for reliable pitch prediction. We also established a comprehensive annotation-and-training pipeline for SVS to test the model in real-world settings. Experimental findings reveal that ROSVOT achieves state-of-the-art transcription accuracy with either clean or noisy inputs. Moreover, when trained on enlarged, automatically annotated datasets, the SVS model outperforms its baseline, affirming the capability for practical application. Audio samples are available at https://rosvot.github.io.
翻译:音符级自动歌声转录(AST)将歌唱录音转换为音符序列,为歌声合成(SVS)应用提供歌唱数据集的自动标注。然而,现有AST方法在实际标注任务中,其准确性与鲁棒性仍有不足。本文提出了ROSVOT,首个服务于SVS的鲁棒AST模型。该模型采用多尺度框架,有效捕捉粗粒度的音符信息并确保细粒度的帧级分割,同时结合基于注意力的音高解码器以实现可靠的音高预测。我们还为SVS构建了一套完整的标注-训练流程,以在真实场景中测试模型性能。实验结果表明,ROSVOT在输入为纯净或含噪音频时均达到了最先进的转录准确率。此外,基于自动标注扩增数据集训练的SVS模型性能超越了基线模型,验证了该方法的实际应用能力。音频样本请访问 https://rosvot.github.io。