Many researchers believe that ConvNets perform well on small or moderately sized datasets, but are not competitive with Vision Transformers when given access to datasets on the web-scale. We challenge this belief by evaluating a performant ConvNet architecture pre-trained on JFT-4B, a large labelled dataset of images often used for training foundation models. We consider pre-training compute budgets between 0.4k and 110k TPU-v4 core compute hours, and train a series of networks of increasing depth and width from the NFNet model family. We observe a log-log scaling law between held out loss and compute budget. After fine-tuning on ImageNet, NFNets match the reported performance of Vision Transformers with comparable compute budgets. Our strongest fine-tuned model achieves a Top-1 accuracy of 90.4%.
翻译:许多研究者认为,卷积神经网络(ConvNets)在中小规模数据集上表现良好,但在网络规模的数据集上无法与Vision Transformers竞争。我们通过评估在JFT-4B(一个常用于训练基础模型的大规模标注图像数据集)上预训练的高性能ConvNet架构来挑战这一观点。我们考虑了0.4k至110k TPU-v4核心计算小时之间的预训练计算预算,并训练了一系列来自NFNet模型系列、深度和宽度递增的网络。我们观察到保留损失与计算预算之间存在对数-对数标度律。在ImageNet上微调后,NFNets在相当的计算预算下达到了与Vision Transformers报告的性能相匹配的水平。我们最强的微调模型实现了90.4%的Top-1准确率。