Learning, Fast and Slow: Towards LLMs That Adapt Continually

Large language models (LLMs) are trained for downstream tasks by updating their parameters (e.g., via RL). However, updating parameters forces them to absorb task-specific information, which can result in catastrophic forgetting and loss of plasticity. In contrast, in-context learning with fixed LLM parameters can cheaply and rapidly adapt to task-specific requirements (e.g., prompt optimization), but cannot by itself typically match the performance gains available through updating LLM parameters. There is no good reason for restricting learning to being in-context or in-weights. Moreover, humans also likely learn at different time scales (e.g., System 1 vs 2). To this end, we introduce a fast-slow learning framework for LLMs, with model parameters as "slow" weights and optimized context as "fast" weights. These fast "weights" can learn from textual feedback to absorb the task-specific information, while allowing slow weights to stay closer to the base model and persist general reasoning behaviors. Fast-Slow Training (FST) is up to 3x more sample-efficient than only slow learning (RL) across reasoning tasks, while consistently reaching a higher performance asymptote. Moreover, FST-trained models remain closer to the base LLM (up to 70% less KL divergence), resulting in less catastrophic forgetting than RL-training. This reduced drift also preserves plasticity: after training on one task, FST trained models adapt more effectively to a subsequent task than parameter-only trained models. In continual learning scenarios, where task domains change on the fly, FST continues to acquire each new task while parameter-only RL stalls.

翻译：大型语言模型（LLMs）通过更新参数（例如通过强化学习）来训练下游任务。然而，更新参数会迫使模型吸收任务特定信息，这可能导致灾难性遗忘和可塑性丧失。相比之下，使用固定LLM参数的上下文学习可以廉价地快速适应任务特定要求（例如提示优化），但通常无法单独达到通过更新LLM参数所获得的性能提升。没有充分理由将学习局限于上下文或权重之中。此外，人类也可能在不同的时间尺度上学习（例如系统1与系统2）。为此，我们为LLMs引入了一种快慢学习框架，将模型参数视为“慢速”权重，将优化后的上下文视为“快速”权重。这些快速“权重”可以从文本反馈中学习以吸收任务特定信息，同时允许慢速权重更接近基础模型并保持通用推理行为。在推理任务上，快慢训练（FST）的样本效率比仅进行慢速学习（RL）高出最多3倍，同时持续达到更高的性能渐近线。此外，经过FST训练的模型更接近基础LLM（KL散度降低最多70%），相比RL训练导致的灾难性遗忘更少。这种减小的漂移也保留了可塑性：在某一任务上训练后，经过FST训练的模型比仅通过参数训练的模型能更有效地适应后续任务。在任务领域实时变化的持续学习场景中，FST会持续获取每个新任务，而仅用参数的RL则停滞不前。