MixMAE: Mixed and Masked Autoencoder for Efficient Pretraining of Hierarchical Vision Transformers

In this paper, we propose Mixed and Masked AutoEncoder (MixMAE), a simple but efficient pretraining method that is applicable to various hierarchical Vision Transformers. Existing masked image modeling (MIM) methods for hierarchical Vision Transformers replace a random subset of input tokens with a special [MASK] symbol and aim at reconstructing original image tokens from the corrupted image. However, we find that using the [MASK] symbol greatly slows down the training and causes pretraining-finetuning inconsistency, due to the large masking ratio (e.g., 60% in SimMIM). On the other hand, MAE does not introduce [MASK] tokens at its encoder at all but is not applicable for hierarchical Vision Transformers. To solve the issue and accelerate the pretraining of hierarchical models, we replace the masked tokens of one image with visible tokens of another image, i.e., creating a mixed image. We then conduct dual reconstruction to reconstruct the two original images from the mixed input, which significantly improves efficiency. While MixMAE can be applied to various hierarchical Transformers, this paper explores using Swin Transformer with a large window size and scales up to huge model size (to reach 600M parameters). Empirical results demonstrate that MixMAE can learn high-quality visual representations efficiently. Notably, MixMAE with Swin-B/W14 achieves 85.1% top-1 accuracy on ImageNet-1K by pretraining for 600 epochs. Besides, its transfer performances on the other 6 datasets show that MixMAE has better FLOPs / performance tradeoff than previous popular MIM methods. Code is available at https://github.com/Sense-X/MixMIM.

翻译：本文提出混合掩码自编码器（MixMAE），一种简单但高效的预训练方法，适用于各类层级视觉Transformer。现有的层级视觉Transformer掩码图像建模方法将输入令牌的随机子集替换为特殊[MASK]符号，并尝试从受损图像中重建原始图像令牌。然而我们发现，由于掩码比例较高（例如SimMIM中的60%），使用[MASK]符号会显著降低训练速度并导致预训练-微调不一致性。另一方面，MAE虽在其编码器中完全不引入[MASK]令牌，却无法适用于层级视觉Transformer。为解决该问题并加速层级模型的预训练，我们将一幅图像的掩码令牌替换为另一幅图像的可见令牌，即创建混合图像。随后我们对混合输入执行双重重建以重构两幅原始图像，这显著提升了效率。尽管MixMAE可应用于多种层级Transformer，本文重点探究使用大窗口Swin Transformer并将其规模扩展至6亿参数。实验结果表明，MixMAE能够高效学习高质量视觉表征。值得注意的是，采用Swin-B/W14的MixMAE经600轮预训练后在ImageNet-1K上达到85.1%的Top-1准确率。此外，在另外6个数据集上的迁移表现显示，相较于先前流行的MIM方法，MixMAE实现了更优的FLOPs/性能权衡。代码已开源至https://github.com/Sense-X/MixMIM。