In this paper, we propose a novel Self-Supervised-Learning scheme to train rhythm analysis systems and instantiate it for few-shot beat tracking. Taking inspiration from the Contrastive Predictive Coding paradigm, we propose to train a Log-Mel-Spectrogram Transformer encoder to contrast observations at times separated by hypothesized beat intervals from those that are not. We do this without the knowledge of ground-truth tempo or beat positions, as we rely on the local maxima of a Predominant Local Pulse function, considered as a proxy for Tatum positions, to define candidate anchors, candidate positives (located at a distance of a power of two from the anchor) and negatives (remaining time positions). We show that a model pre-trained using this approach on the unlabeled FMA, MTT and MTG-Jamendo datasets can successfully be fine-tuned in the few-shot regime, i.e. with just a few annotated examples to get a competitive beat-tracking performance.
翻译:本文提出了一种新颖的自监督学习方案,用于训练节奏分析系统,并针对少样本节拍追踪任务进行了实例化。受对比预测编码范式的启发,我们提出训练一个对数梅尔频谱Transformer编码器,以对比由假设节拍间隔分隔的观测点与非分隔观测点。该方法无需真实节拍速度或节拍位置信息,而是依赖主导局部脉冲函数的局部最大值(作为Tatum位置的代理)来定义候选锚点、候选正样本(位于锚点2的幂次距离处)和负样本(其余时间位置)。实验表明,使用此方法在未标注的FMA、MTT和MTG-Jamendo数据集上进行预训练的模型,能够在少样本场景下(即仅使用少量标注示例)成功进行微调,并获得具有竞争力的节拍追踪性能。