We study the common continual learning setup where an overparameterized model is sequentially fitted to a set of jointly realizable tasks. We analyze forgetting, defined as the loss on previously seen tasks, after $k$ iterations. For continual linear models, we prove that fitting a task is equivalent to a single stochastic gradient descent (SGD) step on a modified objective. We develop novel last-iterate SGD upper bounds in the realizable least squares setup and leverage them to derive new results for continual learning. Focusing on random orderings over $T$ tasks, we establish universal forgetting rates, whereas existing rates depend on problem dimensionality or complexity and become prohibitive in highly overparameterized regimes. In continual regression with replacement, we improve the best existing rate from $O((d-\bar{r})/k)$ to $O(\min(1/\sqrt[4]{k}, \sqrt{(d-\bar{r})}/k, \sqrt{T\bar{r}}/k))$, where $d$ is the dimensionality and $\bar{r}$ the average task rank. Furthermore, we establish the first rate for random task orderings without replacement. The resulting rate $O(\min(1/\sqrt[4]{T},\, (d-\bar{r})/T))$ shows that randomization alone, without task repetition, prevents catastrophic forgetting in sufficiently long task sequences. Finally, we prove a matching $O(1/\sqrt[4]{k})$ forgetting rate for continual linear classification on separable data. Our universal rates extend to broader methods, such as block Kaczmarz and POCS, illuminating their loss convergence under i.i.d. and single-pass orderings.
翻译:本研究探讨了常见的持续学习场景,其中过参数化模型被顺序拟合至一组联合可实现的多个任务。我们分析了在经历$k$次迭代后,先前已见任务上的损失所定义的遗忘现象。针对持续线性模型,我们证明了拟合单个任务等价于在修正目标函数上执行单步随机梯度下降(SGD)。我们在可实现的最小二乘设定中建立了新颖的最终迭代SGD上界,并利用这些结果推导出持续学习的新理论。针对$T$个任务的随机排序场景,我们建立了普适的遗忘率,而现有理论中的遗忘率依赖于问题维度或复杂度,在高度过参数化场景中会变得不可行。在带替换的持续回归任务中,我们将现有最优收敛率从$O((d-\bar{r})/k)$提升至$O(\min(1/\sqrt[4]{k}, \sqrt{(d-\bar{r})}/k, \sqrt{T\bar{r}}/k))$,其中$d$为数据维度,$\bar{r}$为任务平均秩。此外,我们首次建立了无替换随机任务排序场景的收敛率。所得收敛率$O(\min(1/\sqrt[4]{T},\, (d-\bar{r})/T))$表明,仅通过随机化处理(无需任务重复)即可在足够长的任务序列中避免灾难性遗忘。最后,我们证明了在可分数据上的持续线性分类任务中匹配的$O(1/\sqrt[4]{k})$遗忘率。我们的普适收敛率可扩展至更广泛的方法,例如块Kaczmarz算法和投影到凸集法(POCS),从而阐明这些方法在独立同分布和单次遍历排序下的损失收敛特性。