FastCLIP: A Suite of Optimization Techniques to Accelerate CLIP Training with Limited Resources

Existing studies of training state-of-the-art Contrastive Language-Image Pretraining (CLIP) models on large-scale data involve hundreds of or even thousands of GPUs due to the requirement of a large batch size. However, such a large amount of resources is not accessible to most people. While advanced compositional optimization techniques for optimizing global contrastive losses have been demonstrated effective for removing the requirement of large batch size, their performance on large-scale data remains underexplored and not optimized. To bridge the gap, this paper explores several aspects of CLIP training with limited resources (e.g., up to tens of GPUs). First, we introduce FastCLIP, a general CLIP training framework built on advanced compositional optimization techniques while designed and optimized for the distributed setting. Our framework is equipped with an efficient gradient reduction strategy to reduce communication overhead. Second, to further boost training efficiency, we investigate three components of the framework from an optimization perspective: the schedule of the inner learning rate, the update rules of the temperature parameter and the model parameters, respectively. Experiments on different strategies for each component shed light on how to conduct CLIP training more efficiently. Finally, we benchmark the performance of FastCLIP and the state-of-the-art training baseline (OpenCLIP) on different compute scales up to 32 GPUs on 8 nodes, and three data scales ranging from 2.7 million, 9.1 million to 315 million image-text pairs to demonstrate the significant improvement of FastCLIP in the resource-limited setting. We release the code of FastCLIP at https://github.com/Optimization-AI/fast_clip .

翻译：现有研究在训练基于大规模数据的最先进对比语言-图像预训练（CLIP）模型时，由于需要大批次处理，通常需使用数百甚至数千个GPU。然而，如此庞大的资源对大多数人而言难以获取。尽管用于优化全局对比损失的高级组合优化技术已被证明能有效消除对大批次的需求，但其在大规模数据上的性能仍未得到充分探索和优化。为弥补这一差距，本文探索了在有限资源（例如，最多数十个GPU）下进行CLIP训练的多个方面。首先，我们提出了FastCLIP，这是一个基于高级组合优化技术构建的通用CLIP训练框架，专为分布式环境设计并优化。该框架配备了高效的梯度归约策略以降低通信开销。其次，为进一步提升训练效率，我们从优化角度研究了该框架的三个组成部分：内部学习率的调度策略、温度参数的更新规则以及模型参数的更新规则。针对每个组成部分的不同策略进行的实验，揭示了如何更高效地进行CLIP训练。最后，我们在最多8个节点上的32个GPU的计算规模上，以及从270万、910万到3.15亿个图文对的三种数据规模上，对FastCLIP与最先进的训练基线（OpenCLIP）的性能进行了基准测试，以证明FastCLIP在资源有限环境下的显著改进。我们在https://github.com/Optimization-AI/fast_clip 发布了FastCLIP的代码。