The recent work CLIPA presents an inverse scaling law for CLIP training -- whereby the larger the image/text encoders used, the shorter the sequence length of image/text tokens that can be applied in training. This finding enables us to train high-performance CLIP models with significantly reduced computations. Building upon this work, we hereby present CLIPA-v2 with two key contributions. Technically, we find this inverse scaling law is also applicable in the finetuning stage, enabling further reduction in computational needs. Empirically, we explore CLIPA at scale, extending the experiments up to the H/14 model with ~13B image-text pairs seen during training. Our results are exciting -- by only allocating a budget of \$10,000, our CLIP model achieves an impressive zero-shot ImageNet accuracy of 81.1%, surpassing the prior best CLIP model (from OpenCLIP, 80.1%) by 1.0% and meanwhile reducing the computational cost by ~39X. Moreover, with an additional investment of $4,000, we can further elevate the zero-shot ImageNet accuracy to 81.8%. Our code and models are available at https://github.com/UCSC-VLAA/CLIPA.
翻译:近期研究CLIPA提出了CLIP训练的反向缩放定律——即图像/文本编码器规模越大,训练时可应用图像/文本令牌的序列长度越短。这一发现使得我们能够以显著降低的计算量训练高性能CLIP模型。在此工作基础上,本文提出CLIPA-v2,包含两项关键贡献。技术上,我们发现该反向缩放定律同样适用于微调阶段,可进一步降低计算需求。实证上,我们对CLIPA进行了大规模探索,将实验扩展至训练过程中观察到约130亿图像-文本对的H/14模型。我们的结果令人振奋:仅投入1万美元预算,CLIP模型便实现了81.1%的零样本ImageNet准确率,不仅超越此前最优CLIP模型(来自OpenCLIP,80.1%)1.0%,同时将计算成本降低约39倍。此外,追加4000美元投资后,零样本ImageNet准确率可进一步提升至81.8%。我们的代码与模型已开源在https://github.com/UCSC-VLAA/CLIPA。