Generative modeling and representation learning are two key tasks in computer vision. However, these models are typically trained independently, which ignores the potential for each task to help the other, and leads to training and model maintenance overheads. In this work, we propose MAsked Generative Encoder (MAGE), the first framework to unify SOTA image generation and self-supervised representation learning. Our key insight is that using variable masking ratios in masked image modeling pre-training can allow generative training (very high masking ratio) and representation learning (lower masking ratio) under the same training framework. Inspired by previous generative models, MAGE uses semantic tokens learned by a vector-quantized GAN at inputs and outputs, combining this with masking. We can further improve the representation by adding a contrastive loss to the encoder output. We extensively evaluate the generation and representation learning capabilities of MAGE. On ImageNet-1K, a single MAGE ViT-L model obtains 9.10 FID in the task of class-unconditional image generation and 78.9% top-1 accuracy for linear probing, achieving state-of-the-art performance in both image generation and representation learning. Code is available at https://github.com/LTH14/mage.
翻译:生成式建模与表示学习是计算机视觉领域的两项关键任务。然而,这些模型通常独立训练,忽略了彼此任务相互促进的潜力,并导致训练及模型维护开销。本文提出掩码生成式编码器(MAGE),首个统一了当前最先进的图像生成与自监督表示学习的框架。我们的关键洞察在于:在掩码图像建模预训练中使用可变掩码率,能够在不改变训练框架的前提下,同时支持生成式训练(极高掩码率)和表示学习(较低掩码率)。受先前生成式模型启发,MAGE在输入和输出端使用由向量量化生成对抗网络学习的语义标记,并将此与掩码机制相结合。通过在编码器输出端添加对比损失,我们可进一步改进表示学习能力。我们全面评估了MAGE的生成与表示学习性能。在ImageNet-1K上,单个MAGE ViT-L模型在无类别条件图像生成任务中取得9.10 FID,线性探测任务取得78.9%的Top-1准确率,在图像生成和表示学习两项任务中均达到当前最优水平。代码开源于https://github.com/LTH14/mage。