Transformers, which were originally developed for natural language processing, have recently generated significant interest in the computer vision and audio communities due to their flexibility in learning long-range relationships. Constrained by the data hungry nature of transformers and the limited amount of labelled data, most transformer-based models for audio tasks are finetuned from ImageNet pretrained models, despite the huge gap between the domain of natural images and audio. This has motivated the research in self-supervised pretraining of audio transformers, which reduces the dependency on large amounts of labeled data and focuses on extracting concise representations of audio spectrograms. In this paper, we propose \textbf{L}ocal-\textbf{G}lobal \textbf{A}udio \textbf{S}pectrogram v\textbf{I}sion \textbf{T}ransformer, namely ASiT, a novel self-supervised learning framework that captures local and global contextual information by employing group masked model learning and self-distillation. We evaluate our pretrained models on both audio and speech classification tasks, including audio event classification, keyword spotting, and speaker identification. We further conduct comprehensive ablation studies, including evaluations of different pretraining strategies. The proposed ASiT framework significantly boosts the performance on all tasks and sets a new state-of-the-art performance in five audio and speech classification tasks, outperforming recent methods, including the approaches that use additional datasets for pretraining.
翻译:Transformer最初为自然语言处理而设计,因其在学习长程依赖关系方面的灵活性,近期在计算机视觉和音频领域引发了广泛关注。受限于Transformer对数据的需求特性以及标注数据量的不足,尽管自然图像与音频领域之间存在巨大差异,大多数基于Transformer的音频任务模型仍从ImageNet预训练模型微调而来。这促使了音频Transformer自监督预训练的研究,该方法减少了对大量标注数据的依赖,专注于提取音频频谱图的精炼表示。本文提出局部-全局音频频谱图视觉Transformer(ASiT),一种新颖的自监督学习框架,通过结合分组掩码模型学习与自蒸馏技术,捕获局部与全局上下文信息。我们在音频和语音分类任务(包括音频事件分类、关键词识别和说话人识别)上评估了预训练模型,并进一步开展了全面的消融实验,包括对不同预训练策略的评估。所提出的ASiT框架在所有任务上显著提升了性能,并在五项音频与语音分类任务中创下新的最优结果,超越了近期方法(包括使用额外数据集进行预训练的方法)。