Probing is widely adopted in computer vision to faithfully evaluate self-supervised learning (SSL) embeddings, as finetuning may misrepresent their inherent quality. In contrast, audio SSL models still rely on finetuning because simple probing fails to unlock their full potential and alters their rankings when competing on AudioSet. Hence, a robust and efficient probing mechanism is required to guide the trajectory of audio SSL towards reliable and reproducible methods. We introduce Convex Gated Probing (CGP), a prototype-based method that significantly closes the gap between finetuning and probing in audio. CGP efficiently utilizes all frozen layers via a gating mechanism and exposes the location of latent task-relevant information. Guided by CGP as a reliable post-hoc evaluation probe, we rework the entire SSL pipeline of current best performing audio models that use legacy implementations of prior SSL methods. By refining data preprocessing, model architecture, and pretraining recipe, we introduce Better Audio Transformer (BAT), and establish new SOTA on audio benchmarks.
翻译:探测在计算机视觉中被广泛用于忠实评估自监督学习(SSL)嵌入,因为微调可能会扭曲其固有质量。相比之下,音频SSL模型仍依赖微调,因为简单的探测无法充分释放其潜力,并在AudioSet上竞争时改变其排名。因此,需要一种稳健高效的探测机制来引导音频SSL走向可靠且可重复的方法。我们提出凸门控探测(CGP),一种基于原型的方法,显著缩小了音频中微调与探测之间的差距。CGP通过门控机制高效利用所有冻结层,并揭示潜在任务相关信息的所在位置。以CGP作为可靠的后续评估探针,我们重构了当前最佳音频模型(使用先前SSL方法的遗产实现)的整个SSL流程。通过改进数据预处理、模型架构和预训练方案,我们引入了Better Audio Transformer(BAT),并在音频基准上建立了新的最先进水平(SOTA)。