Continual learning agents with finite capacity must balance acquiring new knowledge with retaining the old. This requires controlled forgetting of knowledge that is no longer needed, freeing up capacity to learn. Weight decay, viewed as a mechanism for forgetting, can serve this role by gradually discarding information stored in the weights. However, a fixed scalar weight decay drives this forgetting uniformly over time and uniformly across all parameters, even when some encode stable knowledge while others track rapidly changing targets. We introduce Forgetting through Adaptive Decay (FADE), which adapts per-parameter weight decay rates online via approximate meta-gradient descent. We derive FADE for the online linear setting and apply it to the final layer of neural networks. Our empirical analysis shows that FADE automatically discovers distinct decay rates for different parameters, complements step-size adaptation, and consistently improves over fixed weight decay across online tracking and streaming classification problems.
翻译:有限容量的持续学习主体必须在获取新知识与保留旧知识之间取得平衡。这要求对不再需要的知识进行受控遗忘,从而释放学习容量。权重衰减作为一种遗忘机制,可通过逐步丢弃权重中存储的信息来实现这一作用。然而,固定标量权重衰减会随时间均匀地驱动遗忘过程,且对所有参数一视同仁——即使部分参数编码了稳定知识,而另一些则追踪快速变化的目标。我们提出自适应衰减遗忘(FADE)方法,该方法通过近似元梯度下降在线调整每个参数的权重衰减率。我们推导了在线线性设定下的FADE算法,并将其应用于神经网络的末层。实证分析表明,FADE能自动发现不同参数的差异化衰减率,与步长自适应方法形成互补,并在在线追踪和流式分类问题上持续优于固定权重衰减方法。