CLIP, one of the pioneering foundation models that connect images and text, has enabled many recent breakthroughs in computer vision. However, its associated training cost is prohibitively high, imposing a significant barrier to its widespread exploration. In this paper, we present a surprising finding that there exists an inverse scaling law for CLIP training, whereby the larger the image/text encoders used, the shorter the sequence length of image/text tokens that can be applied in training. Moreover, we showcase that the strategy for reducing image/text token length plays a crucial role in determining the quality of this scaling law. As a result of this finding, we are able to successfully train CLIP even with limited computational resources. For example, using 8 A100 GPUs, our CLIP models achieve zero-shot top-1 ImageNet-1k accuracies of 63.2% in ~2 days, 67.8% in ~3 days, and 69.3% in ~4 days. Our method also works well when scaling up -- with G/14, we register a new record of 83.0% ImageNet-1k zero-shot accuracy, and meanwhile accelerate the training by ~33x compared to its OpenCLIP counterpart. By reducing the computation barrier associated with CLIP, we hope to inspire more research in this field, particularly from academics. Our code is available at https://github.com/UCSC-VLAA/CLIPA.
翻译:CLIP作为连接图像与文本的开创性基础模型之一,推动了计算机视觉领域的诸多突破性进展。然而,其训练成本高昂,严重阻碍了该模型的广泛探索。本文发现了一个令人惊讶的现象:CLIP训练存在反缩放定律——所使用的图像/文本编码器规模越大,训练中可应用的图像/文本令牌序列长度反而越短。此外,我们证明减少图像/文本令牌长度的策略对该缩放定律的质量具有决定性影响。基于这一发现,我们即便在有限计算资源下也能成功训练CLIP。例如,仅使用8块A100 GPU,我们的CLIP模型在约2天内实现ImageNet-1k零样本top-1准确率63.2%,约3天内达到67.8%,约4天内达到69.3%。该方法在更大规模训练中同样表现出色——采用G/14架构时,我们创下ImageNet-1k零样本准确率83.0%的新纪录,同时相较于OpenCLIP实现约33倍的训练加速。通过降低CLIP的计算门槛,我们希望推动该领域(尤其是学术界)的更多研究。我们的代码已开源在https://github.com/UCSC-VLAA/CLIPA。