Reliable fundamental frequency (F 0) and voicing estimation is essential for neural synthesis, yet many pitch extractors depend on large labeled corpora and degrade under realistic recording artifacts. We propose a lightweight, fully self-supervised framework for joint F 0 estimation and voicing inference, designed for rapid single-instrument training from limited audio. Using transposition-equivariant learning on CQT features, we introduce an EM-style iterative reweighting scheme that uses Shift Cross-Entropy (SCE) consistency as a reliability signal to suppress uninformative noisy/unvoiced frames. The resulting weights provide confidence scores that enable pseudo-labeling for a separate lightweight voicing classifier without manual annotations. Trained on MedleyDB and evaluated on MDB-stem-synth ground truth, our method achieves competitive cross-corpus performance (RPA 95.84, RCA 96.24) and demonstrates cross-instrument generalization.
翻译:可靠的基频(F0)与发声状态估计对于神经合成至关重要,然而许多音高提取器依赖于大规模标注语料库,并在实际录音伪影下性能下降。我们提出一种轻量级、完全自监督的联合F0估计与发声推断框架,专为从有限音频数据中进行快速单乐器训练而设计。利用基于常数Q变换特征的平移等变学习,我们引入一种EM风格的迭代重加权方案,该方案使用移位交叉熵一致性作为可靠性信号来抑制无信息量的噪声/非发声帧。所得权重提供置信度分数,可在无需人工标注的情况下为独立的轻量级发声分类器生成伪标签。在MedleyDB上训练并在MDB-stem-synth真实数据上评估,我们的方法取得了具有竞争力的跨语料库性能(RPA 95.84,RCA 96.24),并展示了跨乐器泛化能力。