NeuCLIP: Efficient Large-Scale CLIP Training with Neural Normalizer Optimization

Accurately estimating the normalization term (also known as the partition function) in the contrastive loss is a central challenge for training Contrastive Language-Image Pre-training (CLIP) models. Conventional methods rely on large batches for approximation, demanding substantial computational resources. To mitigate this issue, prior works introduced per-sample normalizer estimators, which are updated at each epoch in a blockwise coordinate manner to keep track of updated encoders. However, this scheme incurs optimization error that scales with the ratio of dataset size to batch size, limiting effectiveness for large datasets or small batches. To overcome this limitation, we propose NeuCLIP, a novel and elegant optimization framework based on two key ideas: (i) $\textbf{reformulating}$ the contrastive loss for each sample $\textbf{via convex analysis}$ into a minimization problem with an auxiliary variable representing its log-normalizer; and (ii) $\textbf{transforming}$ the resulting minimization over $n$ auxiliary variables (where $n$ is the dataset size) via $\textbf{variational analysis}$ into the minimization over a compact neural network that predicts the log-normalizers. We design an alternating optimization algorithm that jointly trains the CLIP model and the auxiliary network. By employing a tailored architecture and acceleration techniques for the auxiliary network, NeuCLIP achieves more accurate normalizer estimation, leading to improved performance compared with previous methods. Extensive experiments on large-scale CLIP training, spanning datasets from millions to billions of samples, demonstrate that NeuCLIP outperforms previous methods. Code is available at https://github.com/Optimization-AI/NeuCLIP.

翻译：在对比性语言-图像预训练（CLIP）模型训练中，准确估计对比损失中的归一化项（亦称配分函数）是一个核心挑战。传统方法依赖大批次进行近似，需要大量计算资源。为缓解此问题，先前研究提出了逐样本归一化估计器，该估计器以分块坐标方式在每个训练周期更新以跟踪编码器的变化。然而，此方案会产生随数据集规模与批次大小比率增长的优化误差，限制了其在大数据集或小批次场景下的有效性。为突破此限制，我们提出NeuCLIP——一种基于两个关键思想的新型优化框架：（i）通过凸分析将每个样本的对比损失重新表述为包含代表其对数归一化器的辅助变量的最小化问题；（ii）通过变分分析将涉及n个辅助变量（n为数据集规模）的最小化问题转化为对预测对数归一化器的紧凑神经网络的最小化。我们设计了一种交替优化算法，可联合训练CLIP模型与辅助网络。通过为辅助网络定制架构并采用加速技术，NeuCLIP实现了更精确的归一化器估计，从而获得优于先前方法的性能。在涵盖数百万至数十亿样本数据集的大规模CLIP训练实验中，NeuCLIP均表现出优越性。代码发布于 https://github.com/Optimization-AI/NeuCLIP。