In this work, we propose a Multi-Window Masked Autoencoder (MW-MAE) fitted with a novel Multi-Window Multi-Head Attention (MW-MHA) module that facilitates the modelling of local-global interactions in every decoder transformer block through attention heads of several distinct local and global windows. Empirical results on ten downstream audio tasks show that MW-MAEs consistently outperform standard MAEs in overall performance and learn better general-purpose audio representations, along with demonstrating considerably better scaling characteristics. Investigating attention distances and entropies reveals that MW-MAE encoders learn heads with broader local and global attention. Analyzing attention head feature representations through Projection Weighted Canonical Correlation Analysis (PWCCA) shows that attention heads with the same window sizes across the decoder layers of the MW-MAE learn correlated feature representations which enables each block to independently capture local and global information, leading to a decoupled decoder feature hierarchy. Code for feature extraction and downstream experiments along with pre-trained models will be released publically.
翻译:本文提出了一种多窗口掩码自编码器(MW-MAE),其配备新颖的多窗口多头注意力(MW-MHA)模块,通过不同局部和全局窗口的注意力头,在解码器每个Transformer块中促进局部-全局交互建模。在十项下游音频任务上的实验结果表明,MW-MAE在整体性能上始终优于标准MAE,能学习到更优的通用音频表征,并展现出显著更好的扩展特性。通过对注意力距离和熵的研究发现,MW-MAE编码器能学习到具有更广泛局部和全局注意力的注意力头。通过投影加权典型相关分析(PWCCA)对注意力头特征表示进行分析表明,MW-MAE解码器各层中具有相同窗口大小的注意力头能学习到相关的特征表示,这使得每个块能独立捕获局部和全局信息,从而形成解耦的解码器特征层次结构。特征提取代码、下游实验代码以及预训练模型将公开发布。