Vocal pitch is an important high-level feature in music audio processing. However, extracting vocal pitch in polyphonic music is more challenging due to the presence of accompaniment. To eliminate the influence of the accompaniment, most previous methods adopt music source separation models to obtain clean vocals from polyphonic music before predicting vocal pitches. As a result, the performance of vocal pitch estimation is affected by the music source separation models. To address this issue and directly extract vocal pitches from polyphonic music, we propose a robust model named RMVPE. This model can extract effective hidden features and accurately predict vocal pitches from polyphonic music. The experimental results demonstrate the superiority of RMVPE in terms of raw pitch accuracy (RPA) and raw chroma accuracy (RCA). Additionally, experiments conducted with different types of noise show that RMVPE is robust across all signal-to-noise ratio (SNR) levels. The code of RMVPE is available at this URL. https://github.com/Dream-High/RMVPE.
翻译:人声基频是音乐音频处理中的重要高级特征。然而,由于伴奏的存在,在多声部音乐中提取人声基频更具挑战性。为消除伴奏影响,多数先前方法采用音乐源分离模型从多声部音乐中提取纯净人声,再进行基频预测。这使得人声基频估计性能受限于音乐源分离模型。为突破这一局限并直接从多声部音乐中提取人声基频,我们提出名为RMVPE的鲁棒模型。该模型能够提取有效隐层特征,并直接从多声部音乐中精确预测人声基频。实验结果表明,RMVPE在原始基频精度(RPA)与原始色度精度(RCA)方面均具有显著优势。此外,在不同类型噪声干扰下的实验显示,RMVPE在所有信噪比(SNR)水平上均保持鲁棒性。RMVPE代码详见此链接:https://github.com/Dream-High/RMVPE。