CLIP, the first foundation model that connects images and text, has enabled many recent breakthroughs in computer vision. However, its associated training cost is prohibitively high, imposing a significant barrier to its widespread exploration. In this paper, we present a surprising finding that there exists an inverse scaling law for CLIP training, whereby the larger the image/text encoders used, the shorter the sequence length of image/text tokens that can be applied in training. Moreover, we showcase that the strategy for reducing image/text token length plays a crucial role in determining the quality of this scaling law. As a result of this finding, we are able to successfully train CLIP even by using academic resources. For example, on an A100 eight-GPU server, our CLIP models achieve zero-shot top-1 ImageNet accuracies of 63.2% in ~2 days, 67.8% in ~3 days, and 69.3% in ~4 days. By reducing the computation barrier associated with CLIP, we hope to inspire more research in this field, particularly from academics. Our code is available at https://github.com/UCSC-VLAA/CLIPA.
翻译:CLIP作为首个连接图像与文本的基础模型,推动了计算机视觉领域的诸多突破性进展。然而其训练成本极高,这严重阻碍了该模型的广泛探索。本文揭示了一个惊人发现:CLIP训练存在逆缩放定律——采用的图像/文本编码器规模越大,可应用于训练的图像/文本令牌序列长度越短。进一步研究表明,缩短图像/文本令牌长度的方法直接决定了该缩放定律的质量。基于该发现,我们甚至能够利用学术资源成功训练CLIP。例如,在配备八块A100 GPU的服务器上,我们的CLIP模型在约2天内实现ImageNet零样本top-1准确率63.2%,约3天内达到67.8%,约4天内达到69.3%。通过降低CLIP的计算门槛,我们期待能激发该领域特别是学术界的更多研究。相关代码已开源至https://github.com/UCSC-VLAA/CLIPA。