One of the biggest missing capabilities in current AI systems is the ability to learn continuously after deployment. Implementing such continually learning systems have several challenges, one of which is the large memory requirement of gradient-based algorithms that are used to train state-of-the-art LLMs. Evolutionary Strategies (ES) have recently re-emerged as a gradient-free alternative to traditional learning algorithms and have shown encouraging performance on specific tasks in LLMs. In this paper, we perform a comprehensive analysis of ES and specifically evaluate its forgetting curves when training for an increasing number of update steps. We first find that ES is able to reach performance numbers close to GRPO for math and reasoning tasks with a comparable compute budget. However, and most importantly for continual learning, the performance gains in ES is accompanied by significant forgetting of prior abilities, limiting its applicability for training models online. We also explore the reason behind this behavior and show that the updates made using ES are much less sparse and have orders of magnitude larger $\ell_2$ norm compared to corresponding GRPO updates, explaining the contrasting forgetting curves between the two algorithms. With this study, we aim to highlight the issue of forgetting in gradient-free algorithms like ES and hope to inspire future work to mitigate these issues.
翻译:当前人工智能系统最大的能力缺失之一在于部署后无法持续学习。实现此类持续学习系统面临若干挑战,其中梯度下降算法需要大量内存是主要瓶颈之一,而该算法正是训练最先进大型语言模型的主流方法。进化策略作为传统学习算法的无梯度替代方案近期重新受到关注,并在大型语言模型的特定任务中展现出令人鼓舞的性能。本文对进化策略进行全面分析,重点评估其在持续增加训练步数时的遗忘曲线。我们首先发现,在可比较的计算预算下,进化策略在数学与推理任务上能达到接近GRPO的性能水平。然而,对于持续学习至关重要的发现是:进化策略的性能提升伴随着对先前能力的显著遗忘,这限制了其在在线模型训练中的应用。我们进一步探究了该现象的内在机制,结果表明相较于GRPO的更新,进化策略产生的更新具有更低的稀疏性,且其$\ell_2$范数量级显著更大,这解释了两类算法遗忘曲线的差异。本研究旨在揭示进化策略等无梯度算法的遗忘问题,以期推动未来研究探索缓解这些问题的有效途径。