The Effect of Architecture During Continual Learning

Continual learning is a challenge for models with static architecture, as they fail to adapt to when data distributions evolve across tasks. We introduce a mathematical framework that jointly models architecture and weights in a Sobolev space, enabling a rigorous investigation into the role of neural network architecture in continual learning and its effect on the forgetting loss. We derive necessary conditions for the continual learning solution and prove that learning only model weights is insufficient to mitigate catastrophic forgetting under distribution shifts. Consequently, we prove that by learning the architecture and weights simultaneously at each task, we can reduce catastrophic forgetting. To learn weights and architecture simultaneously, we formulate continual learning as a bilevel optimization problem: the upper level selects an optimal architecture for a given task, while the lower level computes optimal weights via dynamic programming over all tasks. To solve the upper level problem, we introduce a derivative-free direct search algorithm to determine the optimal architecture. Once found, we must transfer knowledge from the current architecture to the optimal one. However, the optimal architecture will result in a weights parameter space different from the current architecture (i.e., dimensions of weights matrices will not match). To bridge the dimensionality gap, we develop a low-rank transfer mechanism to map knowledge across architectures of mismatched dimensions. Empirical studies across regression and classification problems, including feedforward, convolutional, and graph neural networks, demonstrate that learning the optimal architecture and weights simultaneously yields substantially improved performance (up to two orders of magnitude), reduced forgetting, and enhanced robustness to noise compared with static architecture approaches.

翻译：持续学习对于具有静态架构的模型而言是一个挑战，因为当数据分布随任务演变时，这些模型无法适应。我们引入了一个在Sobolev空间中联合建模架构与权重的数学框架，从而能够严格研究神经网络架构在持续学习中的作用及其对遗忘损失的影响。我们推导了持续学习解的必要条件，并证明在分布偏移下仅学习模型权重不足以缓解灾难性遗忘。因此，我们证明通过在每项任务中同时学习架构与权重，可以减少灾难性遗忘。为了同时学习权重与架构，我们将持续学习表述为一个双层优化问题：上层为给定任务选择最优架构，而下层通过所有任务上的动态规划计算最优权重。为解决上层问题，我们引入了一种无导数的直接搜索算法来确定最优架构。一旦找到最优架构，我们必须将知识从当前架构迁移至最优架构。然而，最优架构将产生与当前架构不同的权重参数空间（即权重矩阵的维度将不匹配）。为弥合维度差距，我们开发了一种低秩迁移机制，以在不匹配维度的架构之间映射知识。在回归与分类问题（包括前馈网络、卷积网络和图神经网络）上的实证研究表明，与静态架构方法相比，同时学习最优架构与权重能显著提升性能（最高达两个数量级）、减少遗忘并增强对噪声的鲁棒性。