The ever-growing availability of unlabeled data presents both opportunities and challenges for training artificial intelligence systems. While self-supervised learning (SSL) has emerged as a powerful paradigm for extracting meaningful representations from vast amounts of unlabeled data, existing methods still struggle to adapt to the non-stationary, non-IID nature of real-world data streams without forgetting previously learned knowledge. Recent works have adopted a repeated cosine annealing schedule for large-scale continual pre-training; however, these schedules (1) inherently cause forgetting during the re-warming phase and (2) have not been systematically compared to existing continual SSL methods. In this work, we systematically compare the widely used cosine schedule with the recently proposed infinite learning rate schedule and empirically find the latter to be a more effective alternative. Our extensive empirical evaluation across diverse image and language datasets demonstrates that the infinite learning rate schedule consistently enhances continual pre-training performance compared to a repeated cosine decay without being restricted to a fixed iteration budget. For instance, in a small-scale MAE pre-training setup, it outperforms several strong baselines from the literature. We then scale up our experiments to larger MAE pre-training and autoregressive language model pre-training. Our results show that the infinite learning rate schedule remains effective at scale, surpassing repeated cosine decay for both MAE pre-training and zero-shot LM benchmarks.
翻译:日益增长的无标签数据为人工智能系统训练带来了机遇与挑战。尽管自监督学习已成为从海量无标签数据中提取有效表征的强大范式,现有方法仍难以适应现实世界数据流的非平稳、非独立同分布特性,且无法避免对已学知识的遗忘。近期研究采用重复余弦退火调度进行大规模持续预训练,然而这类调度方案存在两个根本问题:(1)在重新升温阶段必然引发遗忘现象;(2)尚未与现有持续自监督学习方法进行系统化比较。本研究系统比较了广泛使用的余弦调度与最新提出的无限学习率调度,并通过实证发现后者是更有效的替代方案。我们在多样化的图像与语言数据集上进行广泛实验评估,结果表明:相较于重复余弦衰减,无限学习率调度能持续提升持续预训练性能,且不受固定迭代次数的限制。例如,在小规模MAE预训练设置中,该方法超越了文献中多个强基线模型。我们进一步将实验扩展至更大规模的MAE预训练和自回归语言模型预训练。实验结果显示,无限学习率调度在大规模场景下依然有效,在MAE预训练和零样本语言模型基准测试中均优于重复余弦衰减方案。