In recent advancements in unsupervised visual representation learning, the Joint-Embedding Predictive Architecture (JEPA) has emerged as a significant method for extracting visual features from unlabeled imagery through an innovative masking strategy. Despite its success, two primary limitations have been identified: the inefficacy of Exponential Moving Average (EMA) from I-JEPA in preventing entire collapse and the inadequacy of I-JEPA prediction in accurately learning the mean of patch representations. Addressing these challenges, this study introduces a novel framework, namely C-JEPA (Contrastive-JEPA), which integrates the Image-based Joint-Embedding Predictive Architecture with the Variance-Invariance-Covariance Regularization (VICReg) strategy. This integration is designed to effectively learn the variance/covariance for preventing entire collapse and ensuring invariance in the mean of augmented views, thereby overcoming the identified limitations. Through empirical and theoretical evaluations, our work demonstrates that C-JEPA significantly enhances the stability and quality of visual representation learning. When pre-trained on the ImageNet-1K dataset, C-JEPA exhibits rapid and improved convergence in both linear probing and fine-tuning performance metrics.
翻译:在无监督视觉表征学习的最新进展中,联合嵌入预测架构(JEPA)已成为一种通过创新的掩码策略从未标记图像中提取视觉特征的重要方法。尽管取得了成功,但其存在两个主要局限性:I-JEPA中指数移动平均(EMA)在防止完全坍塌方面的无效性,以及I-JEPA预测在学习图像块表征均值方面的不足。针对这些挑战,本研究引入了一个新颖的框架,即C-JEPA(对比JEPA),该框架将基于图像的联合嵌入预测架构与方差-不变性-协方差正则化(VICReg)策略相结合。这种集成旨在有效学习方差/协方差以防止完全坍塌,并确保增强视图均值的不变性,从而克服已识别的局限性。通过实证和理论评估,我们的工作表明C-JEPA显著提高了视觉表征学习的稳定性和质量。当在ImageNet-1K数据集上进行预训练时,C-JEPA在线性探测和微调性能指标上均表现出更快且改进的收敛性。