Combined Scaling for Zero-shot Transfer Learning

Hieu Pham,Zihang Dai,Golnaz Ghiasi,Kenji Kawaguchi,Hanxiao Liu,Adams Wei Yu,Jiahui Yu,Yi-Ting Chen,Minh-Thang Luong,Yonghui Wu,Mingxing Tan,Quoc V. Le

We present a combined scaling method - named BASIC - that achieves 85.7% top-1 accuracy on the ImageNet ILSVRC-2012 validation set without learning from any labeled ImageNet example. This accuracy surpasses best published similar models - CLIP and ALIGN - by 9.3%. Our BASIC model also shows significant improvements in robustness benchmarks. For instance, on 5 test sets with natural distribution shifts such as ImageNet-{A,R,V2,Sketch} and ObjectNet, our model achieves 84.3% top-1 average accuracy, only a small drop from its original ImageNet accuracy. To achieve these results, we scale up the contrastive learning framework of CLIP and ALIGN in three dimensions: data size, model size, and batch size. Our dataset has 6.6B noisy image-text pairs, which is 4x larger than ALIGN, and 16x larger than CLIP. Our largest model has 3B weights, which is 3.75x larger in parameters and 8x larger in FLOPs than ALIGN and CLIP. Finally, our batch size is 65536 which is 2x more than CLIP and 4x more than ALIGN. We encountered two main challenges with the scaling rules of BASIC. First, the main challenge with implementing the combined scaling rules of BASIC is the limited memory of accelerators, such as GPUs and TPUs. To overcome the memory limit, we propose two simple methods which make use of gradient checkpointing and model parallelism. Second, while increasing the dataset size and the model size has been the defacto method to improve the performance of deep learning models like BASIC, the effect of a large contrastive batch size on such contrastive-trained image-text models is not well-understood. To shed light on the benefits of large contrastive batch sizes, we develop a theoretical framework which shows that larger contrastive batch sizes lead to smaller generalization gaps for image-text models such as BASIC.

翻译：我们提出了一种名为BASIC的联合缩放方法，在不学习任何标注ImageNet样本的情况下，在ImageNet ILSVRC-2012验证集上实现了85.7%的top-1准确率。这一准确率比现有最佳同类模型——CLIP和ALIGN——高出9.3%。我们的BASIC模型在鲁棒性基准测试中也显示出显著提升。例如，在包含自然分布偏移的5个测试集（如ImageNet-{A,R,V2,Sketch}和ObjectNet）上，该模型达到了84.3%的平均top-1准确率，相比原ImageNet准确率仅有轻微下降。为实现这些结果，我们在三个维度上扩展了CLIP和ALIGN的对比学习框架：数据规模、模型规模以及批处理规模。我们的数据集包含66亿对噪声图文对，是ALIGN数据集的4倍、CLIP数据集的16倍。最大模型拥有30亿权重参数，参数量是ALIGN和CLIP的3.75倍，计算量是其8倍。最终批处理规模达到65536，是CLIP的2倍、ALIGN的4倍。我们在BASIC的缩放规则实施中遇到了两大挑战。首先，执行BASIC联合缩放规则的主要困难在于加速器（如GPU和TPU）的内存限制。为突破内存限制，我们提出了两种简单方法，分别利用梯度检查点和模型并行技术。其次，尽管增加数据集规模和模型规模已成为提升BASIC等深度学习模型性能的常规方法，但大对比批处理规模对这类对比训练图文模型的影响尚未被充分理解。为阐明大对比批处理规模的优势，我们建立了一个理论框架，证明更大的对比批处理规模能够缩小BASIC等图文模型的泛化差距。