Note-level Automatic Singing Voice Transcription (AST) converts singing recordings into note sequences, facilitating the automatic annotation of singing datasets for Singing Voice Synthesis (SVS) applications. Current AST methods, however, struggle with accuracy and robustness when used for practical annotation. This paper presents ROSVOT, the first robust AST model that serves SVS, incorporating a multi-scale framework that effectively captures coarse-grained note information and ensures fine-grained frame-level segmentation, coupled with an attention-based pitch decoder for reliable pitch prediction. We also established a comprehensive annotation-and-training pipeline for SVS to test the model in real-world settings. Experimental findings reveal that ROSVOT achieves state-of-the-art transcription accuracy with either clean or noisy inputs. Moreover, when trained on enlarged, automatically annotated datasets, the SVS model outperforms its baseline, affirming the capability for practical application. Audio samples are available at https://rosvot.github.io.
翻译:音符级别的自动歌唱语音转录(AST)将歌唱录音转换为音符序列,便于为歌唱语音合成(SVS)应用自动标注歌唱数据集。然而,当前的AST方法在实际标注中难以保证准确性和鲁棒性。本文提出ROSVOT,首个服务于SVS的鲁棒AST模型,其采用多尺度框架有效捕捉粗粒度音符信息并确保细粒度帧级分割,同时结合基于注意力的音高解码器实现可靠音高预测。我们建立了完整的SVS标注与训练流程,以在真实场景中测试该模型。实验结果表明,ROSVOT在干净或含噪输入下均实现了最先进的转录准确性。此外,当使用扩大的自动标注数据集进行训练时,SVS模型优于其基线,验证了其实用能力。音频样本见https://rosvot.github.io。