Since the introduction of Masked Autoencoders, various improvements to masking techniques have been explored. In this paper, we rethink masking strategies for audio representation learning using masked prediction-based self-supervised learning (SSL) on general audio spectrograms. While recent informed masking techniques have attracted attention, we observe that they incur substantial computational overhead. Motivated by this observation, we propose dispersion-weighted masking (DWM), a lightweight masking strategy that leverages the spectral sparsity inherent in the frequency structure of audio content. Our experiments show that inverse block masking, commonly used in recent SSL frameworks, improves audio event understanding performance while introducing a trade-off in generalization. The proposed DWM alleviates these limitations and computational complexity, leading to consistent performance improvements. This work provides practical guidance on masking strategy design for masked prediction-based audio representation learning.
翻译:自掩码自编码器问世以来,研究人员对掩码技术的多种改进进行了探索。本文重新审视了在通用音频声谱图上采用基于掩码预测的自监督学习(SSL)进行音频表示学习时的掩码策略。尽管最近提出的知情掩码技术引起了广泛关注,但我们发现它们带来了显著的计算开销。基于这一观察,我们提出了分散加权掩码(DWM)——一种轻量级的掩码策略,该策略利用了音频内容频率结构中固有的频谱稀疏性。实验表明,近期SSL框架中常用的逆块掩码虽能提升音频事件理解性能,但同时也引入了泛化能力的权衡。所提出的DWM方法缓解了这些局限性并降低了计算复杂度,从而实现了持续的性能提升。本工作为基于掩码预测的音频表示学习中的掩码策略设计提供了实用指导。