Unsupervised learning of vision transformers seeks to pretrain an encoder via pretext tasks without labels. Among them is the Masked Image Modeling (MIM) aligned with pretraining of language transformers by predicting masked patches as a pretext task. A criterion in unsupervised pretraining is the pretext task needs to be sufficiently hard to prevent the transformer encoder from learning trivial low-level features not generalizable well to downstream tasks. For this purpose, we propose an Adversarial Positional Embedding (AdPE) approach -- It distorts the local visual structures by perturbing the position encodings so that the learned transformer cannot simply use the locally correlated patches to predict the missing ones. We hypothesize that it forces the transformer encoder to learn more discriminative features in a global context with stronger generalizability to downstream tasks. We will consider both absolute and relative positional encodings, where adversarial positions can be imposed both in the embedding mode and the coordinate mode. We will also present a new MAE+ baseline that brings the performance of the MIM pretraining to a new level with the AdPE. The experiments demonstrate that our approach can improve the fine-tuning accuracy of MAE by $0.8\%$ and $0.4\%$ over 1600 epochs of pretraining ViT-B and ViT-L on Imagenet1K. For the transfer learning task, it outperforms the MAE with the ViT-B backbone by $2.6\%$ in mIoU on ADE20K, and by $3.2\%$ in AP$^{bbox}$ and $1.6\%$ in AP$^{mask}$ on COCO, respectively. These results are obtained with the AdPE being a pure MIM approach that does not use any extra models or external datasets for pretraining. The code is available at https://github.com/maple-research-lab/AdPE.
翻译:无监督学习视觉Transformer旨在通过无标签的预文本任务预训练编码器。其中,掩码图像建模(MIM)与语言Transformer的预训练对齐,将预测掩码补丁作为预文本任务。无监督预训练的一个准则是:预文本任务需具备足够难度,以防止Transformer编码器学习到无法泛化至下游任务的琐碎低级特征。为此,我们提出对抗性位置嵌入(AdPE)方法——通过扰动位置编码来扭曲局部视觉结构,使学习到的Transformer无法简单地利用局部相关补丁预测缺失补丁。我们假设该方法能迫使Transformer编码器在全局上下文中学习更具判别力的特征,从而增强对下游任务的泛化能力。我们同时考虑绝对位置编码和相对位置编码,其中对抗性位置可在嵌入模式和坐标模式下施加。此外,我们提出新的MAE+基线,通过AdPE将MIM预训练性能提升至新水平。实验表明,在ImageNet1K上对ViT-B和ViT-L进行1600轮预训练后,我们的方法可将MAE的微调准确率分别提升0.8%和0.4%。在迁移学习任务中,基于ViT-B骨干网络的AdPE在ADE20K上的mIoU提升2.6%,在COCO上的AP^{bbox}和AP^{mask}分别提升3.2%和1.6%。这些结果均通过纯MIM方法AdPE实现,未使用任何额外模型或外部数据集进行预训练。代码已开源至https://github.com/maple-research-lab/AdPE。