Understanding whether self-supervised learning methods can scale with unlimited data is crucial for training large-scale models. In this work, we conduct an empirical study on the scaling capability of masked image modeling (MIM) methods (e.g., MAE) for visual recognition. Unlike most previous works that depend on the widely-used ImageNet dataset, which is manually curated and object-centric, we take a step further and propose to investigate this problem in a more practical setting. Specifically, we utilize the web-collected Coyo-700M dataset. We randomly sample varying numbers of training images from the Coyo dataset and construct a series of sub-datasets, containing 0.5M, 1M, 5M, 10M, and 100M images, for pre-training. Our goal is to investigate how the performance changes on downstream tasks when scaling with different sizes of data and models. The study reveals that: 1) MIM can be viewed as an effective method to improve the model capacity when the scale of the training data is relatively small; 2) Strong reconstruction targets can endow the models with increased capacities on downstream tasks; 3) MIM pre-training is data-agnostic under most scenarios, which means that the strategy of sampling pre-training data is non-critical. We hope these observations could provide valuable insights for future research on MIM.
翻译:理解自监督学习方法能否随无限数据扩展规模,对于训练大规模模型至关重要。本文通过实证研究,探究掩码图像建模(MIM)方法(例如 MAE)在视觉识别任务中的缩放能力。与以往大多数依赖广泛使用的 ImageNet 数据集(该数据集为人工筛选且以目标为中心)的研究不同,我们进一步提出在更实际的环境中探讨此问题。具体而言,我们利用从网络收集的 Coyo-700M 数据集,从 Coyo 数据集中随机采样不同数量的训练图像,构建包含 0.5M、1M、5M、10M 和 100M 图像的系列子数据集进行预训练。目标是研究在不同数据量和模型规模下,下游任务性能如何变化。研究表明:1)当训练数据规模相对较小时,MIM 可视为提升模型容量的有效方法;2)强重建目标能够赋予模型更强的下游任务容量;3)在大多数场景下,MIM 预训练与数据无关,即预训练数据的采样策略并非关键因素。希望这些观察能为未来 MIM 研究提供有价值的见解。