Multi-pitch estimation is a decades-long research problem involving the detection of pitch activity associated with concurrent musical events within multi-instrument mixtures. Supervised learning techniques have demonstrated solid performance on more narrow characterizations of the task, but suffer from limitations concerning the shortage of large-scale and diverse polyphonic music datasets with multi-pitch annotations. We present a suite of self-supervised learning objectives for multi-pitch estimation, which encourage the concentration of support around harmonics, invariance to timbral transformations, and equivariance to geometric transformations. These objectives are sufficient to train an entirely convolutional autoencoder to produce multi-pitch salience-grams directly, without any fine-tuning. Despite training exclusively on a collection of synthetic single-note audio samples, our fully self-supervised framework generalizes to polyphonic music mixtures, and achieves performance comparable to supervised models trained on conventional multi-pitch datasets.
翻译:多音高估计是一个长达数十年的研究问题,涉及检测多乐器混合中同时发生的音乐事件所对应的音高活动。监督学习技术在该任务的较窄特征刻画上表现出色,但受限于缺少大规模、多样化的带有音高标注的复调音乐数据集。我们提出了一套用于多音高估计的自监督学习目标,这些目标鼓励支持区域集中于谐波、对音色变换具有不变性、对几何变换具有等变性。这些目标足以训练一个完全卷积自编码器,直接生成多音高显著图,而无需任何微调。尽管仅使用合成单音音频样本进行训练,我们的完全自监督框架仍能泛化至复调音乐混合,并达到与在传统多音高数据集上训练的监督模型相当的性能。