Self-supervised learning (SSL) on large-scale datasets like AudioSet has become the dominant paradigm for audio representation learning. While the continuous influx of new, unlabeled audio presents an opportunity to enrich these static representations, a naive approach is to retrain the model from scratch using all available data. However, this method is computationally prohibitive and discards the valuable knowledge embedded in the previously trained model weights. To address this inefficiency, we propose SONAR (Self-distilled cONtinual pre-training for domain adaptive Audio Representation), a continual pre-training framework built upon BEATs. SONAR effectively adapts to new domains while mitigating catastrophic forgetting by tackling three key challenges: implementing a joint sampling strategy for new and prior data, applying regularization to balance specificity and generality, and dynamically expanding the tokenizer codebook for novel acoustic patterns. Experiments across four distinct domains demonstrate that our method achieves both high adaptability and robust resistance to forgetting.
翻译:基于大规模数据集(如AudioSet)的自监督学习已成为音频表征学习的主流范式。尽管持续涌入的新未标注音频数据为丰富这些静态表征提供了机遇,但一种朴素的方法是使用所有可用数据从头重新训练模型。然而,这种方法计算成本过高,且丢弃了先前训练模型权重中蕴含的宝贵知识。为解决这一低效问题,我们提出了SONAR(面向领域自适应音频表征的自蒸馏持续预训练),这是一个基于BEATs构建的持续预训练框架。SONAR通过应对三个关键挑战,在有效适应新领域的同时缓解灾难性遗忘:实施新旧数据的联合采样策略,应用正则化以平衡特异性和通用性,以及为新颖声学模式动态扩展分词器码本。在四个不同领域的实验表明,我们的方法同时实现了高适应性和强大的抗遗忘能力。