Automatic Chord Recognition (ACR) is constrained by the scarcity of aligned chord labels, as well-aligned annotations are costly to acquire. At the same time, open-weight pre-trained models are currently more accessible than their proprietary training data. In this work, we present a two-stage training pipeline that leverages pre-trained models together with unlabeled audio. The proposed method decouples training into two stages. In the first stage, we use a pre-trained BTC model as a teacher to generate pseudo-labels for over 1,000 hours of diverse unlabeled audio and train a student model solely on these pseudo-labels. In the second stage, the student is continually trained on ground-truth labels as they become available. To prevent catastrophic forgetting of the representations learned in the first stage, we apply selective knowledge distillation (KD) from the teacher as a regularizer. In our experiments, two models (BTC, 2E1D) were used as students. In stage 1, using only pseudo-labels, the BTC student achieves over 99% of the teacher's performance, while the 2E1D model achieves about 97% across seven standard mir_eval metrics. After a single training run for both students in stage 2, the resulting BTC student model surpasses the traditional supervised learning baseline by 2.5% and the original pre-trained teacher model by 1.1-3.2% across all metrics. The resulting 2E1D student model improves over the traditional supervised learning baseline by 2.67% on average and achieves almost the same performance as the teacher. Both cases show large gains on rare chord qualities.
翻译:自动和弦识别(ACR)受限于对齐和弦标签的稀缺性,因为高精度对齐的标注成本高昂。与此同时,开放权重的预训练模型目前比其专有训练数据更易获取。本文提出一种两阶段训练流程,利用预训练模型与无标签音频协同工作。该方法将训练过程解耦为两个阶段:第一阶段,使用预训练的BTC模型作为教师,为超过1000小时多样化无标签音频生成伪标签,并仅基于这些伪标签训练学生模型;第二阶段,当真实标签可用时,继续训练学生模型。为防止第一阶段所学表征的灾难性遗忘,我们采用来自教师的选择性知识蒸馏(KD)作为正则化项。实验中,两种模型(BTC、2E1D)作为学生模型。在第一阶段,仅使用伪标签时,BTC学生模型在七项标准mir_eval指标上达到教师模型性能的99%以上,而2E1D模型达到约97%。经过第二阶段对两个学生模型的单次训练后,BTC学生模型在所有指标上均超越传统监督学习基线2.5%,并超越原始预训练教师模型1.1-3.2%。2E1D学生模型相比传统监督学习基线平均提升2.67%,且达到与教师模型几乎相同的性能。两种案例在罕见和弦类别上均展现显著提升效果。