Existing studies of training state-of-the-art Contrastive Language-Image Pretraining (CLIP) models on large-scale data involve hundreds of or even thousands of GPUs due to the requirement of a large batch size. However, such a large amount of resources is not accessible to most people. While advanced compositional optimization techniques for optimizing global contrastive losses have been demonstrated effective for removing the requirement of large batch size, their performance on large-scale data remains underexplored and not optimized. To bridge the gap, this paper explores several aspects of CLIP training with limited resources (e.g., up to tens of GPUs). First, we introduce FastCLIP, a general CLIP training framework built on advanced compositional optimization techniques while designed and optimized for the distributed setting. Our framework is equipped with an efficient gradient reduction strategy to reduce communication overhead. Second, to further boost training efficiency, we investigate three components of the framework from an optimization perspective: the schedule of the inner learning rate, the update rules of the temperature parameter and the model parameters, respectively. Experiments on different strategies for each component shed light on how to conduct CLIP training more efficiently. Finally, we benchmark the performance of FastCLIP and the state-of-the-art training baseline (OpenCLIP) on different compute scales up to 32 GPUs on 8 nodes, and three data scales ranging from 2.7 million, 9.1 million to 315 million image-text pairs to demonstrate the significant improvement of FastCLIP in the resource-limited setting. We release the code of FastCLIP at https://github.com/Optimization-AI/fast_clip .
翻译:现有研究在训练基于大规模数据的先进对比语言-图像预训练(CLIP)模型时,由于需要大批次大小,通常需使用数百甚至数千个GPU。然而,如此大量的资源对大多数人而言难以获取。尽管用于优化全局对比损失的先进组合优化技术已被证明能有效消除大批次需求,但其在大规模数据上的性能仍未得到充分探索和优化。为弥补这一差距,本文探索了在有限资源(例如最多数十个GPU)下进行CLIP训练的多个方面。首先,我们提出了FastCLIP,这是一个基于先进组合优化技术构建的通用CLIP训练框架,同时针对分布式环境进行了专门设计和优化。该框架配备了高效的梯度归约策略以降低通信开销。其次,为进一步提升训练效率,我们从优化角度研究了框架的三个组成部分:内部学习率的调度策略、温度参数的更新规则以及模型参数的更新规则。针对每个组成部分的不同策略进行的实验,揭示了如何更高效地进行CLIP训练。最后,我们在不同计算规模(最多8个节点上的32个GPU)和三种数据规模(从270万、910万到3.15亿个图文对)上,对FastCLIP与最先进的训练基线(OpenCLIP)的性能进行了基准测试,以证明FastCLIP在资源受限环境下的显著改进。我们在 https://github.com/Optimization-AI/fast_clip 发布了FastCLIP的代码。