We present Multiscale Audio Spectrogram Transformer (MAST) for audio classification, which brings the concept of multiscale feature hierarchies to the Audio Spectrogram Transformer (AST). Given an input audio spectrogram, we first patchify and project it into an initial temporal resolution and embedding dimension, post which the multiple stages in MAST progressively expand the embedding dimension while reducing the temporal resolution of the input. We use a pyramid structure that allows early layers of MAST operating at a high temporal resolution but low embedding space to model simple low-level acoustic information and deeper temporally coarse layers to model high-level acoustic information with high-dimensional embeddings. We also extend our approach to present a new Self-Supervised Learning (SSL) method called SS-MAST, which calculates a symmetric contrastive loss between latent representations from a student and a teacher encoder, leveraging patch-drop, a novel audio augmentation approach that we introduce. In practice, MAST significantly outperforms AST by an average accuracy of 3.4% across 8 speech and non-speech tasks from the LAPE Benchmark, achieving state-of-the-art results on keyword spotting in Speech Commands. Additionally, our proposed SS-MAST achieves an absolute average improvement of 2.6% over the previously proposed SSAST.
翻译:我们提出了用于音频分类的多尺度音频频谱图Transformer(MAST),将多尺度特征层级概念引入音频频谱图Transformer(AST)。对于输入音频频谱图,我们首先将其分块并投影为初始时间分辨率和嵌入维度,随后MAST中的多个阶段逐步扩展嵌入维度,同时降低输入的时间分辨率。我们采用金字塔结构,使MAST的早期层以高时间分辨率但低嵌入空间运行,以建模简单的底层声学信息;而较深的时间粗粒度层则利用高维嵌入来建模高层声学信息。此外,我们还扩展了该方法,提出了一种新的自监督学习(SSL)方法——SS-MAST,该方法在教师编码器与学生编码器的潜在表示之间计算对称对比损失,并利用我们引入的新型音频增强方法——补丁丢弃(patch-drop)。在实际应用中,MAST在LAPE基准测试的8项语音与非语音任务中,平均准确率比AST显著提升3.4%,在Speech Commands的关键词检测任务上达到当前最优。同时,我们提出的SS-MAST相比此前提出的SSAST取得了2.6%的绝对平均改进。