Masked image modeling (MIM), which predicts randomly masked patches from unmasked ones, has emerged as a promising approach in self-supervised vision pretraining. However, the theoretical understanding of MIM is rather limited, especially with the foundational architecture of transformers. In this paper, to the best of our knowledge, we provide the first end-to-end theory of learning one-layer transformers with softmax attention in MIM self-supervised pretraining. On the conceptual side, we posit a theoretical mechanism of how transformers, pretrained with MIM, produce empirically observed local and diverse attention patterns on data distributions with spatial structures that highlight feature-position correlations. On the technical side, our end-to-end analysis of the training dynamics of softmax-based transformers accommodates both input and position embeddings simultaneously, which is developed based on a novel approach to track the interplay between the attention of feature-position and position-wise correlations.
翻译:掩码图像建模(MIM)通过从不掩码的图像块预测随机掩码的图像块,已成为自监督视觉预训练中一种有前景的方法。然而,MIM的理论理解仍然相当有限,尤其是基于Transformer这一基础架构。在本文中,据我们所知,我们首次提出了在MIM自监督预训练中学习带softmax注意力的单层Transformer的端到端理论。在概念层面,我们假设了一个理论机制,解释经过MIM预训练的Transformer如何在具有空间结构的数据分布上产生经验观察到的局部化和多样化注意力模式,这些模式突出了特征-位置相关性。在技术层面,我们对基于softmax的Transformer训练动态的端到端分析同时容纳了输入嵌入和位置嵌入,这是基于一种跟踪特征-位置相关性与位置相关性之间注意力交互的新方法而开发的。