Probing is widely adopted in computer vision to faithfully evaluate self-supervised learning (SSL) embeddings, as fine-tuning may misrepresent their inherent quality. In contrast, audio SSL models still rely on fine-tuning because simple probing fails to unlock their full potential and alters their rankings when competing for SOTA on AudioSet. Hence, a robust and efficient probing mechanism is required to guide the trajectory of audio SSL towards reliable and reproducible methods. We introduce Convex Gated Probing (CGP), a prototype-based method that drastically closes the gap between fine-tuning and probing in audio. CGP efficiently utilizes all frozen layers via a gating mechanism and exposes the location of latent task-relevant information. Guided by CGP, we rework the entire SSL pipeline of current SOTA audio models that use legacy implementations of prior SSL methods. By refining data preprocessing, model architecture, and pre-training recipe, we introduce Better Audio Transformer (BAT), and establish new SOTA on audio benchmarks.
翻译:在计算机视觉领域,探测被广泛用于忠实评估自监督学习嵌入,因为微调可能无法准确反映其固有质量。相比之下,音频自监督学习模型仍依赖微调,因为简单的探测方法无法充分释放其潜力,且在AudioSet等基准上竞争最优性能时会改变模型排名。因此,需要一种鲁棒且高效的探测机制来引导音频自监督学习向可靠且可复现的方法发展。本文提出凸门控探测,这是一种基于原型的方法,能显著缩小音频领域中微调与探测之间的性能差距。该方法通过门控机制高效利用所有冻结层,并揭示潜在任务相关信息的分布位置。在凸门控探测的指导下,我们重构了当前最优音频模型中采用传统自监督学习实现方案的整个流程。通过改进数据预处理、模型架构和预训练策略,我们提出了改进音频Transformer,并在音频基准测试中创造了新的最优性能记录。