Large language models exhibit remarkable performance across diverse tasks through pre-training and fine-tuning paradigms. However, continual fine-tuning on sequential tasks induces catastrophic forgetting, where newly acquired knowledge interferes with previously learned capabilities. Despite widespread observations of this phenomenon, the mechanistic understanding remains limited. Here, we present a comprehensive mechanistic analysis of catastrophic forgetting in transformer-based LLMs during sequential fine-tuning. Through systematic experiments across multiple model scales (109B to 400B total parameters) and task sequences, we identify three primary mechanisms driving forgetting: gradient interference in attention weights, representational drift in intermediate layers, and loss landscape flattening. We demonstrate that forgetting severity correlates strongly with task similarity (Pearson r = 0.87) and gradient alignment metrics. Our analysis reveals that approximately 15 to 23 percent of attention heads undergo severe disruption during fine-tuning, with lower layers showing greater susceptibility. These findings establish mechanistic foundations for developing targeted mitigation strategies in continual learning systems.
翻译:大语言模型通过预训练与微调范式在多样化任务中展现出卓越性能。然而,在序列任务上的持续微调会引发灾难性遗忘,即新获取的知识会干扰先前习得的能力。尽管该现象已被广泛观测,其机制理解仍较为有限。本文对基于Transformer架构的大语言模型在序列微调过程中的灾难性遗忘进行了全面的机制分析。通过在多种模型规模(总参数量109B至400B)与任务序列上的系统性实验,我们识别出驱动遗忘的三个主要机制:注意力权重中的梯度干扰、中间层的表征漂移以及损失景观平坦化。我们证明遗忘严重程度与任务相似性(Pearson r = 0.87)及梯度对齐度量指标呈强相关。分析表明,约15%至23%的注意力头在微调过程中经历严重扰动,且较低层表现出更高的易感性。这些发现为持续学习系统中开发针对性缓解策略奠定了机制基础。