We need billion-scale images to achieve more generalizable and ground-breaking vision models, as well as massive dataset storage to ship the images (e.g., the LAION-4B dataset needs 240TB storage space). However, it has become challenging to deal with unlimited dataset storage with limited storage infrastructure. A number of storage-efficient training methods have been proposed to tackle the problem, but they are rarely scalable or suffer from severe damage to performance. In this paper, we propose a storage-efficient training strategy for vision classifiers for large-scale datasets (e.g., ImageNet) that only uses 1024 tokens per instance without using the raw level pixels; our token storage only needs <1% of the original JPEG-compressed raw pixels. We also propose token augmentations and a Stem-adaptor module to make our approach able to use the same architecture as pixel-based approaches with only minimal modifications on the stem layer and the carefully tuned optimization settings. Our experimental results on ImageNet-1k show that our method significantly outperforms other storage-efficient training methods with a large gap. We further show the effectiveness of our method in other practical scenarios, storage-efficient pre-training, and continual learning. Code is available at https://github.com/naver-ai/seit
翻译:我们需要十亿级规模的图像来获得更具泛化性和突破性的视觉模型,同时也需要海量数据集存储来承载这些图像(例如LAION-4B数据集需要240TB存储空间)。然而,在有限的存储基础设施下处理无限增长的数据集存储已成为挑战。现有许多存储高效训练方法试图解决该问题,但它们要么难以扩展,要么性能严重受损。本文提出针对大规模数据集(如ImageNet)的视觉分类器存储高效训练策略——每个实例仅使用1024个令牌且不依赖原始像素级数据;我们的令牌存储仅需原始JPEG压缩像素的<1%。我们还提出令牌增强方法和Stem-Adaptor模块,使本方法能采用与基于像素方法相同的架构,仅需对茎层和精心调优的优化设置进行最小改动。在ImageNet-1k上的实验结果表明,我们的方法以显著优势超越其他存储高效训练方法。我们还进一步验证了本方法在存储高效预训练和持续学习等其他实际场景中的有效性。代码开源地址:https://github.com/naver-ai/seit