Multimodal foundation models serve numerous applications at the intersection of vision and language. Still, despite being pretrained on extensive data, they become outdated over time. To keep models updated, research into continual pretraining mainly explores scenarios with either (1) infrequent, indiscriminate updates on large-scale new data, or (2) frequent, sample-level updates. However, practical model deployment often operates in the gap between these two limit cases, as real-world applications often demand adaptation to specific subdomains, tasks or concepts -- spread over the entire, varying life cycle of a model. In this work, we complement current perspectives on continual pretraining through a research test bed as well as provide comprehensive guidance for effective continual model updates in such scenarios. We first introduce FoMo-in-Flux, a continual multimodal pretraining benchmark with realistic compute constraints and practical deployment requirements, constructed over 63 datasets with diverse visual and semantic coverage. Using FoMo-in-Flux, we explore the complex landscape of practical continual pretraining through multiple perspectives: (1) A data-centric investigation of data mixtures and stream orderings that emulate real-world deployment situations, (2) a method-centric investigation ranging from simple fine-tuning and traditional continual learning strategies to parameter-efficient updates and model merging, (3) meta learning rate schedules and mechanistic design choices, and (4) the influence of model and compute scaling. Together, our insights provide a practitioner's guide to continual multimodal pretraining for real-world deployment. Our benchmark and code is here: https://github.com/ExplainableML/fomo_in_flux.
翻译:多模态基础模型在视觉与语言交叉领域支撑着众多应用。然而,尽管这些模型已在海量数据上进行预训练,其性能仍会随时间推移而逐渐过时。为保持模型更新,当前持续预训练研究主要聚焦于两种极端场景:(1) 基于大规模新数据的低频、无差别更新,或(2) 样本级的高频更新。然而,实际模型部署往往处于这两种极限情况之间的过渡地带——现实应用通常要求模型在整个变化生命周期中持续适应特定子领域、任务或概念。本研究通过构建研究测试平台,为这类场景下的有效持续模型更新提供全面指导,从而完善当前持续预训练的研究视角。我们首先提出FoMo-in-Flux:一个具有现实计算约束与部署需求的持续多模态预训练基准,该基准整合了63个涵盖多样化视觉与语义内容的数据集。基于FoMo-in-Flux,我们从多维度探索实践性持续预训练的复杂图景:(1) 数据视角:模拟真实部署场景的数据混合与流顺序研究;(2) 方法视角:涵盖从简单微调、传统持续学习策略到参数高效更新与模型融合的技术路径;(3) 元学习率调度与机制化设计选择;(4) 模型规模与计算资源的影响分析。综合这些洞察,我们为现实部署场景中的持续多模态预训练提供了一份从业者指南。基准测试与代码详见:https://github.com/ExplainableML/fomo_in_flux。