Keeping large foundation models up to date on latest data is inherently expensive. To avoid the prohibitive costs of constantly retraining, it is imperative to continually train these models. This problem is exacerbated by the lack of any large scale continual learning benchmarks or baselines. We introduce the first set of web-scale Time-Continual (TiC) benchmarks for training vision-language models: TiC-DataComp, TiC-YFCC, and TiC-Redcaps. TiC-DataComp, our largest dataset, contains over 12.7B timestamped image-text pairs spanning 9 years (2014-2022). We first use our benchmarks to curate various dynamic evaluations to measure temporal robustness of existing models. We show OpenAI's CLIP (trained on data up to 2020) loses $\approx 8\%$ zero-shot accuracy on our curated retrieval task from 2021-2022 compared with more recently trained models in OpenCLIP repository. We then study how to efficiently train models on time-continuous data. We demonstrate that a simple rehearsal-based approach that continues training from the last checkpoint and replays old data reduces compute by $2.5\times$ when compared to the standard practice of retraining from scratch. Code is available at https://github.com/apple/ml-tic-clip.
翻译:使大型基础模型始终更新到最新数据天然代价高昂。为避免持续重新训练的过高成本,必须对模型进行持续训练。然而,缺乏大规模持续学习基准或基线方法加剧了这一问题。我们引入了首个大规模Web级时间连续(TiC)基准用于训练视觉-语言模型:TiC-DataComp、TiC-YFCC和TiC-Redcaps。其中最大数据集的TiC-DataComp包含超过127亿个带时间戳的图像-文本对,覆盖2014-2022年共9年数据。首先,基于这些基准设计了多种动态评估方案,用以衡量现有模型的时间鲁棒性。实验表明,OpenAI的CLIP(使用截至2020年的数据训练)在2021-2022年检索任务上的零样本准确率,相比OpenCLIP仓库中近期训练的模型损失约8%。继而研究了如何高效训练时间连续数据。我们证明,与从头重新训练这一常规做法相比,采用基于重放机制的简单方法(从上次检查点继续训练并回放旧数据)可将计算量降低2.5倍。代码开源地址:https://github.com/apple/ml-tic-clip。