In this work, we re-examine inter-patch dependencies in the decoding mechanism of masked autoencoders (MAE). We decompose this decoding mechanism for masked patch reconstruction in MAE into self-attention and cross-attention. Our investigations suggest that self-attention between mask patches is not essential for learning good representations. To this end, we propose a novel pretraining framework: Cross-Attention Masked Autoencoders (CrossMAE). CrossMAE's decoder leverages only cross-attention between masked and visible tokens, with no degradation in downstream performance. This design also enables decoding only a small subset of mask tokens, boosting efficiency. Furthermore, each decoder block can now leverage different encoder features, resulting in improved representation learning. CrossMAE matches MAE in performance with 2.5 to 3.7$\times$ less decoding compute. It also surpasses MAE on ImageNet classification and COCO instance segmentation under the same compute. Code and models: https://crossmae.github.io
翻译:在这项工作中,我们重新审视了掩码自编码器(MAE)解码机制中掩码块之间的依赖性。我们将MAE中掩码块重建的解码机制分解为自注意力(self-attention)和交叉注意力(cross-attention)。我们的研究表明,掩码块之间的自注意力对于学习良好表示并非必要。为此,我们提出了一种新的预训练框架:交叉注意力掩码自编码器(CrossMAE)。CrossMAE的解码器仅利用掩码块和可见块之间的交叉注意力,且在下游任务性能上没有退化。该设计还允许仅解码掩码令牌的一个小子集,从而提升效率。此外,每个解码器块现在可以利用不同的编码器特征,从而改进表示学习。CrossMAE在解码计算量减少2.5到3.7倍的情况下,性能与MAE相匹配。在相同计算量下,它在ImageNet分类和COCO实例分割任务上均超越MAE。代码与模型:https://crossmae.github.io