Bootstrap-based Self-Supervised Learning (SSL) has achieved remarkable progress in audio understanding. However, existing methods typically operate at a single level of granularity, limiting their ability to model the diverse temporal and spectral structures inherent in complex audio signals. Furthermore, bootstrapping representations from scratch is computationally expensive, often requiring extensive training to converge. In this work, we propose the Convolutional Audio Transformer (CAT), a unified framework designed to address these challenges. First, to capture hierarchical audio features, CAT incorporates a Multi-resolution Block that aggregates information across varying granularities. Second, to enhance training efficiency, we introduce a Representation Regularization objective. Drawing inspiration from generative modeling, this auxiliary task guides the student model by aligning its predictions with high-quality semantic representations from frozen, pre-trained external encoders. Experimental results demonstrate that CAT significantly outperforms baselines on audio understanding benchmarks. Notably, it achieves competitive performance on the AudioSet 20k dataset with 5 times faster convergence than existing methods. Codes and checkpoints will be released soon at https://github.com/realzhouchushu/CAT.
翻译:基于自举的自监督学习在音频理解领域取得了显著进展。然而,现有方法通常仅在单一粒度上操作,限制了其建模复杂音频信号中固有的多样化时频结构的能力。此外,从零开始自举表示在计算上代价高昂,通常需要大量训练才能收敛。在本工作中,我们提出了卷积音频Transformer,这是一个旨在解决这些挑战的统一框架。首先,为了捕捉层次化的音频特征,CAT包含一个多分辨率模块,用于聚合不同粒度的信息。其次,为了提高训练效率,我们引入了一个表示正则化目标。受生成式建模的启发,该辅助任务通过将学生模型的预测与来自冻结的、预训练的外部编码器的高质量语义表示对齐,来指导学生模型。实验结果表明,CAT在音频理解基准测试中显著优于基线方法。值得注意的是,它在AudioSet 20k数据集上取得了具有竞争力的性能,且收敛速度比现有方法快5倍。代码和检查点将很快发布于 https://github.com/realzhouchushu/CAT。