Cross-modal pre-training has shown impressive performance on a wide range of downstream tasks, benefiting from massive image-text pairs collected from the Internet. In practice, online data are growing constantly, highlighting the importance of the ability of pre-trained model to learn from data that is continuously growing. Existing works on cross-modal pre-training mainly focus on training a network with fixed architecture. However, it is impractical to limit the model capacity when considering the continuously growing nature of pre-training data in real-world applications. On the other hand, it is important to utilize the knowledge in the current model to obtain efficient training and better performance. To address the above issues, in this paper, we propose GrowCLIP, a data-driven automatic model growing algorithm for contrastive language-image pre-training with continuous image-text pairs as input. Specially, we adopt a dynamic growth space and seek out the optimal architecture at each growth step to adapt to online learning scenarios. And the shared encoder is proposed in our growth space to enhance the degree of cross-modal fusion. Besides, we explore the effect of growth in different dimensions, which could provide future references for the design of cross-modal model architecture. Finally, we employ parameter inheriting with momentum (PIM) to maintain the previous knowledge and address the issue of the local minimum dilemma. Compared with the existing methods, GrowCLIP improves 2.3% average top-1 accuracy on zero-shot image classification of 9 downstream tasks. As for zero-shot image retrieval, GrowCLIP can improve 1.2% for top-1 image-to-text recall on Flickr30K dataset.
翻译:跨模态预训练得益于从互联网收集的海量图像-文本对,在下游任务上展现出显著性能。实践中,在线数据持续增长,这凸显了预训练模型从持续增长数据中学习能力的重要性。现有跨模态预训练研究主要聚焦于固定架构的网络训练。然而,当考虑现实应用中预训练数据的持续增长特性时,限制模型容量并不现实。另一方面,利用当前模型的知识实现高效训练与更优性能至关重要。针对上述问题,本文提出GrowCLIP——一种面向持续图像-文本对输入的对比语言-图像预训练的数据驱动自动模型增长算法。具体而言,我们采用动态增长空间,在每个增长步中寻找最优架构以适应在线学习场景。并在增长空间中提出共享编码器以增强跨模态融合程度。此外,我们探索了不同维度增长的影响,可为跨模态模型架构设计提供未来参考。最后,采用动量参数继承(PIM)维持先前知识并解决局部最小值困境。与现有方法相比,GrowCLIP在9个下游任务的零样本图像分类中平均Top-1准确率提升2.3%;在零样本图像检索任务中,Flickr30K数据集上的图像到文本Top-1召回率提升1.2%。