We provide a statistical analysis of regularization-based continual learning on a sequence of linear regression tasks, with emphasis on how different regularization terms affect the model performance. We first derive the convergence rate for the oracle estimator obtained as if all data were available simultaneously. Next, we consider a family of generalized $\ell_2$-regularization algorithms indexed by matrix-valued hyperparameters, which includes the minimum norm estimator and continual ridge regression as special cases. As more tasks are introduced, we derive an iterative update formula for the estimation error of generalized $\ell_2$-regularized estimators, from which we determine the hyperparameters resulting in the optimal algorithm. Interestingly, the choice of hyperparameters can effectively balance the trade-off between forward and backward knowledge transfer and adjust for data heterogeneity. Moreover, the estimation error of the optimal algorithm is derived explicitly, which is of the same order as that of the oracle estimator. In contrast, our lower bounds for the minimum norm estimator and continual ridge regression show their suboptimality. A byproduct of our theoretical analysis is the equivalence between early stopping and generalized $\ell_2$-regularization in continual learning, which may be of independent interest. Finally, we conduct experiments to complement our theory.
翻译:本文对基于正则化的持续学习在一系列线性回归任务上的表现进行了统计性分析,重点探讨了不同正则化项如何影响模型性能。我们首先推导了假设所有数据同时可得的理想估计量(oracle estimator)的收敛速率。接着,我们研究了一类由矩阵值超参数索引的广义$\ell_2$正则化算法族,该族包含了最小范数估计量与持续岭回归作为特例。随着任务数量的增加,我们推导了广义$\ell_2$正则化估计量估计误差的迭代更新公式,并据此确定了能够产生最优算法的超参数设置。有趣的是,超参数的选择能够有效平衡前向与后向知识迁移之间的权衡,并适应数据的异质性。此外,我们显式推导了最优算法的估计误差,其阶数与理想估计量相同。相比之下,我们对最小范数估计量与持续岭回归给出的下界表明了它们的次优性。我们理论分析的一个副产品是持续学习中早停法与广义$\ell_2$正则化之间的等价性,这一结论可能具有独立的研究价值。最后,我们通过实验对理论分析进行了补充。