KeepLoRA++: Continual Learning with Layer-Scaled Residual Gradient Adaptation

Continual learning for pre-trained vision-language models requires balancing three competing objectives: retaining pre-trained knowledge, preserving knowledge from a sequence of learned tasks, and maintaining the plasticity to acquire new knowledge. This paper presents KeepLoRA++, balancing these objectives through a unified dual-dimensional knowledge retention mechanism. We analyze knowledge distribution of Transformer architecture from both inter-layer and intra-layer perspectives. The inter-layer perspective examines how retention is distributed across layers, while the intra-layer perspective focuses on the parameter space within each layer. Our analysis reveals a structural property: general transferable knowledge is mainly encoded in the shallow layers and the principal subspace of the parameters, while task-specific adaptations are localized in the deep layers and the residual subspace. Motivated by this insight, KeepLoRA++ introduces a layer-scaled residual gradient adaptation method. New tasks are learned by restricting LoRA parameter updates to the residual subspace, combined with a shallow-to-deep layer scaling, to prevent interference with previously acquired capabilities. Specifically, the gradient of a new task is projected onto a subspace orthogonal to both the principal subspace of the pre-trained model and the dominant directions of previous task features, while simultaneously assigning smaller update magnitudes to shallow layers and larger ones to deeper layers. Our theoretical analysis and empirical evaluations confirm that KeepLoRA++ successfully balances these three competing objectives, consistently outperforming representative baselines across image classification, visual question answering, and video understanding tasks.

翻译：针对预训练视觉-语言模型的持续学习需要平衡三个相互竞争的目标：保留预训练知识、保持已学习任务序列的知识积累，以及维持获取新知识的可塑性。本文提出KeepLoRA++，通过统一的双维度知识保留机制实现上述目标的平衡。我们从层间和层内两个维度分析Transformer架构的知识分布特征：层间视角考察知识保留如何跨层分布，层内视角聚焦各层参数空间的结构。分析揭示了一个结构性特征：通用可迁移知识主要编码在浅层和参数的主子空间中，而任务特异性适应则集中于深层和参数的残差子空间。基于这一发现，KeepLoRA++引入层缩放残差梯度自适应方法。通过将LoRA参数更新限制在残差子空间，并配合从浅层到深层的层缩放策略，使新任务学习时避免干扰已有能力。具体而言，新任务的梯度被投影到预训练模型主子空间与先前任务特征主方向的正交补空间，同时对浅层赋予较小更新幅度、对深层赋予较大更新幅度。理论分析与实验验证表明，KeepLoRA++成功平衡了这三个竞争目标，在图像分类、视觉问答和视频理解任务中持续优于代表性基线方法。