AaSP: Aliasing-aware Self-Supervised Pre-Training for Audio Spectrogram Transformers

Transformer-based audio self-supervised learning (SSL) models commonly use spectrograms, vision-style Transformers, and masked modeling objectives. However, convolutional patchification with temporal downsampling lowers the effective Nyquist frequency and introduces aliasing, while naïve low-pass filtering may remove task-relevant high-frequency cues. We present AaSP, an aliasing-aware self-supervised pre-training framework for audio spectrogram transformers. AaSP combines an aliasing-aware patch representation, teacher-student masked modeling, a cross-attention predictor, and multi-mask contrastive regularization to learn representations that integrate features from alias-prone modulation bands while remaining stable across masked views. Its patch-embedding module, Aliasing-aware Patch Embedding (AaPE), augments standard patch tokens with features from alias-prone modulation bands using a band-limited complex sinusoidal kernel with a two-sided exponential window. The kernel's frequency and decay parameters are estimated from the input, enabling adaptive subband analysis whose outputs are fused with standard patch tokens. We pre-train on AudioSet and evaluate the learned representations by fine-tuning and linear evaluation on acoustic/environmental, speech, and music recognition benchmarks. Under fine-tuning, the full AaSP framework achieves state-of-the-art results on AS-20K, ESC-50, and NSynth among compared self-supervised baselines, while remaining competitive elsewhere. Linear evaluation shows a similar trend, including gains on US8K and NSynth. Overall, AaSP learns representations that are more stable under aliasing-sensitive temporal perturbations and competitive for downstream transfer.

翻译：基于Transformer的音频自监督学习模型通常使用频谱图、视觉风格Transformer和掩码建模目标。然而，带有时间下采样的卷积分块过程会降低有效奈奎斯特频率并引入混叠，而简单的低通滤波可能移除任务相关的高频信息。我们提出AaSP——一种面向音频频谱图Transformer的抗混叠自监督预训练框架。AaSP结合抗混叠分块表示、师生掩码建模、交叉注意力预测器以及多掩码对比正则化，学习能整合来自易混叠调制频带特征且在不同掩码视图下保持稳定的表示。其分块嵌入模块——抗混叠分块嵌入（AaPE）——通过采用双边指数窗的带限复数正弦核，用易混叠调制频带的特征增强标准分块令牌。该核的频率和衰减参数从输入中估计，实现自适应子带分析，其输出与标准分块令牌融合。我们在AudioSet上预训练，并通过在声学/环境、语音和音乐识别基准上的微调和线性评估来评估所学表示。在微调设置下，完整AaSP框架在AS-20K、ESC-50和NSynth上相比于所比较的自监督基线取得了最先进结果，并在其他任务上保持竞争力。线性评估显示类似趋势，包括在US8K和NSynth上的提升。总体而言，AaSP学习到的表示在混叠敏感的时间扰动下更为稳定，且在下游迁移中具有竞争力。