This paper demonstrates an approach for learning highly semantic image representations without relying on hand-crafted data-augmentations. We introduce the Image-based Joint-Embedding Predictive Architecture (I-JEPA), a non-generative approach for self-supervised learning from images. The idea behind I-JEPA is simple: from a single context block, predict the representations of various target blocks in the same image. A core design choice to guide I-JEPA towards producing semantic representations is the masking strategy; specifically, it is crucial to (a) sample target blocks with sufficiently large scale (semantic), and to (b) use a sufficiently informative (spatially distributed) context block. Empirically, when combined with Vision Transformers, we find I-JEPA to be highly scalable. For instance, we train a ViT-Huge/14 on ImageNet using 16 A100 GPUs in under 72 hours to achieve strong downstream performance across a wide range of tasks, from linear classification to object counting and depth prediction.
翻译:本文提出了一种无需依赖手工数据增强即可学习高语义图像表示的方法。我们介绍了基于图像的联合嵌入预测架构(I-JEPA),这是一种用于图像自监督学习的非生成式方法。I-JEPA的核心思想简洁明了:从单个上下文块出发,预测同一图像中多个目标块的表示。引导I-JEPA生成语义表示的关键设计选择在于掩码策略;具体而言,关键在于:(a)采样足够大尺度(语义性)的目标块,以及(b)使用信息量充足(空间分布广泛)的上下文块。实验表明,当与视觉Transformer(Vision Transformers)结合时,I-JEPA展现出高度可扩展性。例如,我们仅使用16块A100 GPU在72小时内训练了ImageNet上的ViT-Huge/14模型,在从线性分类到目标计数和深度预测的广泛下游任务中均取得了强劲性能。