We introduce Corrupted Image Modeling (CIM) for self-supervised visual pre-training. CIM uses an auxiliary generator with a small trainable BEiT to corrupt the input image instead of using artificial [MASK] tokens, where some patches are randomly selected and replaced with plausible alternatives sampled from the BEiT output distribution. Given this corrupted image, an enhancer network learns to either recover all the original image pixels, or predict whether each visual token is replaced by a generator sample or not. The generator and the enhancer are simultaneously trained and synergistically updated. After pre-training, the enhancer can be used as a high-capacity visual encoder for downstream tasks. CIM is a general and flexible visual pre-training framework that is suitable for various network architectures. For the first time, CIM demonstrates that both ViT and CNN can learn rich visual representations using a unified, non-Siamese framework. Experimental results show that our approach achieves compelling results in vision benchmarks, such as ImageNet classification and ADE20K semantic segmentation.
翻译:我们提出了一种用于自监督视觉预训练的损坏图像建模(CIM)方法。CIM使用一个带有小型可训练BEiT的辅助生成器来损坏输入图像,而非使用人工设定的[MASK]标记;具体地,随机选取部分图像块,并用从BEiT输出分布中采样的合理替代内容替换这些块。针对此类损坏图像,增强器网络学习恢复全部原始图像像素,或判断每个视觉标记是否被生成器样本替换。生成器与增强器同时训练并协同更新。预训练完成后,增强器可作为高容量视觉编码器用于下游任务。CIM是一种通用且灵活的视觉预训练框架,适用于多种网络架构。我们首次证明,ViT与CNN均可通过统一的非孪生框架学习丰富的视觉表征。实验结果表明,该方法在ImageNet分类与ADE20K语义分割等视觉基准任务上取得了令人信服的结果。