Fine-tuning large language models (LLMs) for downstream tasks is an essential stage of modern AI deployment. Reinforcement learning (RL) has emerged as the dominant fine-tuning paradigm, underpinning many state-of-the-art LLMs. In contrast, evolution strategies (ES) has largely been overlooked due to the widespread belief that it does not scale to modern model sizes. This paper overturns this assumption by demonstrating the first successful application of ES to full-parameter fine-tuning of LLMs at the billion-parameter scale, without dimensionality reduction. ES can indeed search over extremely high-dimensional parameter spaces and outperform established RL implementations across multiple axes, including improved tolerance to long-horizon and delayed rewards, robustness across diverse base LLMs, reduced susceptibility to reward hacking, and improved training stability. These findings suggest that ES is not merely a viable alternative to RL, but a fundamentally different and powerful backpropagation-free post-training paradigm that opens a new direction for LLM fine-tuning beyond current RL-based approaches. The source codes are provided at: https://github.com/VsonicV/es-fine-tuning-paper.
翻译:针对下游任务对大型语言模型(LLM)进行微调是现代人工智能部署的关键阶段。强化学习(RL)已成为主流的微调范式,支撑着许多最先进的LLM。相比之下,进化策略(ES)在很大程度上被忽视,因为普遍认为其无法扩展到现代模型的规模。本文推翻了这一假设,首次成功展示了ES在十亿参数级别上对LLM进行全参数微调的应用,且未使用降维技术。ES确实能够在极高维参数空间中进行搜索,并在多个维度上超越成熟的RL实现,包括:对长时程和延迟奖励的更高容忍度、在不同基础LLM间的更强鲁棒性、降低对奖励破解的敏感性,以及提升的训练稳定性。这些发现表明,ES不仅是RL的可行替代方案,更是一种根本不同且强大的、无需反向传播的后训练范式,为超越当前基于RL的方法开辟了LLM微调的新方向。源代码发布于:https://github.com/VsonicV/es-fine-tuning-paper。