ELLA: Efficient Lifelong Learning for Adapters in Large Language Models

Large Language Models (LLMs) suffer severe catastrophic forgetting when adapted sequentially to new tasks in a continual learning (CL) setting. Existing approaches are fundamentally limited: replay-based methods are impractical and privacy-violating, while strict orthogonality-based methods collapse under scale: each new task is projected onto an orthogonal complement, progressively reducing the residual degrees of freedom and eliminating forward transfer by forbidding overlap in shared representations. In this work, we introduce ELLA, a training framework built on the principle of selective subspace de-correlation. Rather than forbidding all overlap, ELLA explicitly characterizes the structure of past updates and penalizes alignments along their high-energy, task-specific directions, while preserving freedom in the low-energy residual subspaces to enable transfer. Formally, this is realized via a lightweight regularizer on a single aggregated update matrix. We prove this mechanism corresponds to an anisotropic shrinkage operator that bounds interference, yielding a penalty that is both memory- and compute-constant regardless of task sequence length. ELLA requires no data replay, no architectural expansion, and negligible storage. Empirically, it achieves state-of-the-art CL performance on three popular benchmarks, with relative accuracy gains of up to $9.6\%$ and a $35\times$ smaller memory footprint. Further, ELLA scales robustly across architectures and actively enhances the model's zero-shot generalization performance on unseen tasks, establishing a principled and scalable solution for constructive lifelong LLM adaptation.

翻译：在持续学习（CL）场景中，当大型语言模型（LLMs）被依次适配到新任务时，会遭受严重的灾难性遗忘问题。现有方法存在根本性局限：基于回放的方法既不实用又违反隐私保护，而基于严格正交性的方法在规模扩展下会失效——每个新任务被投影到正交补空间上，逐步减少残差自由度，并通过禁止共享表征的重叠来消除前向迁移。本文提出ELLA，一种基于选择性子空间去相关原则的训练框架。ELLA并非禁止所有重叠，而是显式刻画历史更新结构，沿其高能量、任务特定方向惩罚对齐，同时在低能量残差子空间中保留自由度以实现迁移。该机制通过轻量级正则化器作用于单一聚合更新矩阵得以实现。我们证明该机制对应一个能约束干扰的各向异性收缩算子，所产生的惩罚项在内存和计算上均为常数，与任务序列长度无关。ELLA无需数据回放、无需架构扩展，且存储开销可忽略。实验表明，该方法在三个主流基准测试中达到最先进的持续学习性能，相对准确率提升最高达$9.6\%$，内存占用减少$35$倍。此外，ELLA在不同架构间具有稳健的扩展性，并能主动增强模型在未见任务上的零样本泛化性能，为构建性终身LLM适配提供了原则性且可扩展的解决方案。