The paper presents a scalable approach for learning spatially distributed visual representations over individual tokens and a holistic instance representation simultaneously. We use self-attention blocks to represent spatially distributed tokens, followed by cross-attention blocks to aggregate the holistic image instance. The core of the approach is the use of extremely large token masking (75\%-90\%) as the data augmentation for supervision. Our model, named ExtreMA, follows the plain BYOL approach where the instance representation from the unmasked subset is trained to predict that from the intact input. Instead of encouraging invariance across inputs, the model is required to capture informative variations in an image. The paper makes three contributions: 1) It presents random masking as a strong and computationally efficient data augmentation for siamese representation learning. 2) With multiple sampling per instance, extreme masking greatly speeds up learning and improves performance with more data. 3) ExtreMA obtains stronger linear probing performance than masked modeling methods, and better transfer performance than prior contrastive models.
翻译:本文提出了一种可扩展的方法,用于同时学习单个标记的空间分布式视觉表征和整体实例表征。我们使用自注意力块来表示空间分布式标记,随后通过交叉注意力块聚合整体图像实例。该方法的核心是使用极高比例的标记掩码(75%-90%)作为监督学习的数增强手段。我们的模型名为ExtreMA,遵循简单的BYOL方法,即从未掩码子集获得的实例表征被训练用于预测完整输入的表征。该模型不要求跨输入的不变性,而是要求捕获图像中的信息性变化。本文做出了三项贡献:1)提出了随机掩码作为孪生表征学习中一种强大且计算高效的数据增强方法。2)通过每个实例的多次采样,极端掩码显著加速了学习进程,并在更多数据下提升了性能。3)ExtreMA在线性探针评估中优于掩码建模方法,并在迁移性能上优于先前的对比学习模型。