Context Channel Capacity: An Information-Theoretic Framework for Understanding Catastrophic Forgetting

Catastrophic forgetting remains a central challenge in continual learning (CL), yet lacks a unified information-theoretic explanation for why some architectures forget catastrophically while others do not. We introduce \emph{Context Channel Capacity} ($C_\mathrm{ctx}$), the mutual information between a CL architecture's context signal and its generated parameters, and prove that zero forgetting requires $C_\mathrm{ctx} \geq H(T)$, where $H(T)$ is the task identity entropy. We establish an \emph{Impossibility Triangle} -- zero forgetting, online learning, and finite parameters cannot be simultaneously satisfied by sequential state-based learners -- and show that conditional regeneration architectures (HyperNetworks) bypass this triangle by redefining parameters as function values rather than states. We validate this framework across 8 CL methods on Split-MNIST (1,130+ experiments over 86 days, 4 seeds each), showing that $C_\mathrm{ctx}$ perfectly predicts forgetting behavior: methods with $C_\mathrm{ctx} = 0$ (NaiveSGD, EWC, SI, LwF, CFlow) exhibit catastrophic forgetting (6--97\%), while methods with $C_\mathrm{ctx} \approx 1$ (HyperNetwork) achieve zero forgetting (98.8\% ACC). We further propose \emph{Wrong-Context Probing} (P5), a practical diagnostic protocol for measuring $C_\mathrm{ctx}$, and extend the framework to CIFAR-10 via a novel \emph{Gradient Context Encoder} that closes the oracle gap from 23.3pp to 0.7pp. A systematic taxonomy of 15+ closed research directions -- including the Hebbian null result (frozen random features outperform learned features), CFlow's $θ_0$-memorizer phenomenon, and the $S_N$ symmetry barrier to column specialization -- provides the community with precisely diagnosed negative results. Our central design principle: \emph{architecture over algorithm} -- the context pathway must be structurally unbypassable.

翻译：灾难性遗忘仍然是持续学习（CL）中的一个核心挑战，但缺乏统一的信息论解释来说明为何某些架构会发生灾难性遗忘而其他架构则不会。我们引入了\emph{上下文信道容量}（$C_\mathrm{ctx}$），即CL架构的上下文信号与其生成参数之间的互信息，并证明零遗忘要求$C_\mathrm{ctx} \geq H(T)$，其中$H(T)$是任务身份熵。我们建立了一个\emph{不可能三角}——零遗忘、在线学习和有限参数无法被基于顺序状态的学**器同时满足——并证明条件再生架构（HyperNetwork）通过将参数重新定义为函数值而非状态，从而绕过了这个三角。我们在Split-MNIST数据集上对8种CL方法（超过1,130次实验，历时86天，每种方法4个随机种子）验证了该框架，结果表明$C_\mathrm{ctx}$完美预测了遗忘行为：$C_\mathrm{ctx} = 0$的方法（NaiveSGD、EWC、SI、LwF、CFlow）表现出灾难性遗忘（6\%–97\%），而$C_\mathrm{ctx} \approx 1$的方法（HyperNetwork）实现了零遗忘（98.8\% ACC）。我们进一步提出了\emph{错误上下文探测}（P5），一种用于测量$C_\mathrm{ctx}$的实用诊断协议，并通过一种新颖的\emph{梯度上下文编码器}将框架扩展到CIFAR-10数据集，将预言机差距从23.3个百分点缩小到0.7个百分点。一个包含15个以上已关闭研究方向的系统分类法——包括赫布学习零结果（冻结的随机特征优于学习到的特征）、CFlow的$θ_0$记忆现象，以及列专业化面临的$S_N$对称性障碍——为社区提供了经过精确诊断的负面结果。我们的核心设计原则是：\emph{架构优于算法}——上下文通路必须在结构上不可绕过。