Patch-Wise Self-Supervised Visual Representation Learning: A Fine-Grained Approach

Self-supervised visual representation learning traditionally focuses on image-level instance discrimination. Our study introduces an innovative, fine-grained dimension by integrating patch-level discrimination into these methodologies. This integration allows for the simultaneous analysis of local and global visual features, thereby enriching the quality of the learned representations. Initially, the original images undergo spatial augmentation. Subsequently, we employ a distinctive photometric patch-level augmentation, where each patch is individually augmented, independent from other patches within the same view. This approach generates a diverse training dataset with distinct color variations in each segment. The augmented images are then processed through a self-distillation learning framework, utilizing the Vision Transformer (ViT) as its backbone. The proposed method minimizes the representation distances across both image and patch levels to capture details from macro to micro perspectives. To this end, we present a simple yet effective patch-matching algorithm to find the corresponding patches across the augmented views. Thanks to the efficient structure of the patch-matching algorithm, our method reduces computational complexity compared to similar approaches. Consequently, we achieve an advanced understanding of the model without adding significant computational requirements. We have extensively pretrained our method on datasets of varied scales, such as Cifar10, ImageNet-100, and ImageNet-1K. It demonstrates superior performance over state-of-the-art self-supervised representation learning methods in image classification and downstream tasks, such as copy detection and image retrieval. The implementation of our method is accessible on GitHub.

翻译：传统的自监督视觉表征学习主要关注图像级别的实例判别。本研究通过将分块级判别整合到这些方法中，引入了一种创新的细粒度维度。这种整合能够同时分析局部和全局视觉特征，从而丰富学习表征的质量。首先，原始图像经过空间增强处理。随后，我们采用独特的光度分块级增强方法，其中每个分块独立于同一视图内的其他分块进行单独增强。这种方法通过在每个图像片段中生成不同的颜色变化，创建了多样化的训练数据集。增强后的图像随后通过以Vision Transformer（ViT）为骨干网络的自蒸馏学习框架进行处理。所提出的方法通过最小化图像级别和分块级别的表征距离，从宏观到微观视角捕捉细节特征。为此，我们提出了一种简单而有效的分块匹配算法，用于在增强视图间寻找对应分块。得益于分块匹配算法的高效结构，我们的方法相比类似方法降低了计算复杂度。因此，我们在不显著增加计算需求的情况下实现了对模型的深入理解。我们在不同规模的数据集（如Cifar10、ImageNet-100和ImageNet-1K）上进行了广泛的预训练实验。该方法在图像分类及下游任务（如复制检测和图像检索）中表现出优于当前最先进的自监督表征学习方法的性能。本方法的实现代码已在GitHub上开源。