The reliance on large-scale datasets and extensive computational resources has become a major barrier to advancing representation learning in vision, especially in data-scarce domains. In this paper, we address the critical question: Can we escape the big data paradigm in self-supervised representation learning from images? We introduce SCOTT (Sparse Convolutional Tokenizer for Transformers), a shallow tokenization architecture that is compatible with Masked Image Modeling (MIM) tasks. SCOTT injects convolutional inductive biases into Vision Transformers (ViTs), enhancing their efficacy in small-scale data regimes. Alongside, we propose to train on a Joint-Embedding Predictive Architecture within a MIM framework (MIM-JEPA), operating in latent representation space to capture more semantic features. Our approach enables ViTs to be trained from scratch on datasets orders of magnitude smaller than traditionally required --without relying on massive external datasets for pretraining. We validate our method on three small-size, standard-resoultion, fine-grained datasets: Oxford Flowers-102, Oxford IIIT Pets-37, and ImageNet-100. Despite the challenges of limited data and high intra-class similarity, frozen SCOTT models pretrained with MIM-JEPA significantly outperform fully supervised methods and achieve competitive results with SOTA approaches that rely on large-scale pretraining, complex image augmentations and bigger model sizes. By demonstrating that robust off-the-shelf representations can be learned with limited data, compute, and model sizes, our work paves the way for computer applications in resource constrained environments such as medical imaging or robotics. Our findings challenge the prevailing notion that vast amounts of data are indispensable for effective representation learning in vision, offering a new pathway toward more accessible and inclusive advancements in the field.
翻译:依赖大规模数据集和大量计算资源已成为视觉表征学习发展的主要障碍,尤其是在数据稀缺领域。本文探讨一个关键问题:在图像的自监督表征学习中,我们能否摆脱大数据范式?我们提出SCOTT(面向Transformer的稀疏卷积分词器),这是一种与掩码图像建模任务兼容的浅层分词架构。SCOTT将卷积归纳偏置注入视觉Transformer,提升其在小规模数据场景下的效能。同时,我们提出在MIM框架内训练联合嵌入预测架构,通过在潜在表征空间操作来捕获更多语义特征。我们的方法使ViT能够在比传统要求小数个数量级的数据集上从头训练,且无需依赖海量外部数据进行预训练。我们在三个小规模标准分辨率细粒度数据集上验证了方法:Oxford Flowers-102、Oxford IIIT Pets-37和ImageNet-100。尽管面临数据有限和类内相似度高的挑战,采用MIM-JEPA预训练的冻结SCOTT模型显著优于全监督方法,并与依赖大规模预训练、复杂图像增强和更大模型规模的SOTA方法取得可比结果。通过证明仅需有限数据、计算资源和模型规模即可学习到鲁棒的即用表征,我们的工作为医疗影像或机器人等资源受限环境中的计算机应用开辟了新途径。这些发现挑战了"海量数据对视觉表征学习不可或缺"的主流观念,为该领域迈向更易获取和包容的进展提供了新路径。