Catastrophic forgetting in continual learning is often measured at the performance or last-layer representation level, overlooking the underlying mechanisms. We introduce a mechanistic framework that offers a geometric interpretation of catastrophic forgetting as the result of transformations to the encoding of individual features. These transformations can lead to forgetting by reducing the allocated capacity of features (worse representation) and disrupting their readout by downstream computations. Analysis of a tractable model formalizes this view, allowing us to identify best- and worst-case scenarios. Through experiments on this model, we empirically test our formal analysis and highlight the detrimental effect of depth. Finally, we demonstrate how our framework can be used in the analysis of practical models through the use of Crosscoders. We present a case study of a Vision Transformer trained on sequential CIFAR-10. Our work provides a new, feature-centric vocabulary for continual learning.
翻译:持续学习中的灾难性遗忘通常从性能或最后一层表征层面进行度量,忽视了其底层机制。我们提出了一个机制性框架,为灾难性遗忘提供了一种几何解释,将其视为个体特征编码发生变换的结果。这些变换可能通过减少特征分配容量(表征质量下降)和干扰下游计算对其的读取而导致遗忘。通过对一个可解析模型的分析,我们将这一观点形式化,从而能够识别最优与最差情况。通过在该模型上的实验,我们实证检验了形式化分析结果,并揭示了网络深度的负面影响。最后,我们展示了如何通过使用Crosscoders将该框架应用于实际模型分析。我们以在序列化CIFAR-10上训练的Vision Transformer为例进行了案例研究。本工作为持续学习领域提供了一套以特征为中心的新术语体系。