Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models

Harnessing the power of human-annotated data through Supervised Fine-Tuning (SFT) is pivotal for advancing Large Language Models (LLMs). In this paper, we delve into the prospect of growing a strong LLM out of a weak one without the need for acquiring additional human-annotated data. We propose a new fine-tuning method called Self-Play fIne-tuNing (SPIN), which starts from a supervised fine-tuned model. At the heart of SPIN lies a self-play mechanism, where the LLM refines its capability by playing against instances of itself. More specifically, the LLM generates its own training data from its previous iterations, refining its policy by discerning these self-generated responses from those obtained from human-annotated data. Our method progressively elevates the LLM from a nascent model to a formidable one, unlocking the full potential of human-annotated demonstration data for SFT. Theoretically, we prove that the global optimum to the training objective function of our method is achieved only when the LLM policy aligns with the target data distribution. Empirically, we evaluate our method on several benchmark datasets including the HuggingFace Open LLM Leaderboard, MT-Bench, and datasets from Big-Bench. Our results show that SPIN can significantly improve the LLM's performance across a variety of benchmarks and even outperform models trained through direct preference optimization (DPO) supplemented with extra GPT-4 preference data. This sheds light on the promise of self-play, enabling the achievement of human-level performance in LLMs without the need for expert opponents. Codes are available at https://github.com/uclaml/SPIN.

翻译：通过监督微调（SFT）利用人类标注数据的力量对于推进大语言模型（LLM）的发展至关重要。本文深入探讨了在不获取额外人类标注数据的情况下，将一个弱LLM培育为强LLM的前景。我们提出了一种名为自博弈微调（SPIN）的新微调方法，该方法从一个经过监督微调的模型开始。SPIN的核心在于一种自博弈机制，LLM通过与自身的多个实例进行博弈来提升其能力。更具体地说，LLM从其先前迭代中生成自身的训练数据，通过区分这些自我生成的响应与从人类标注数据中获得的响应来优化其策略。我们的方法逐步将LLM从一个初始模型提升为一个强大的模型，从而充分释放了用于SFT的人类标注示范数据的潜力。理论上，我们证明了只有当LLM策略与目标数据分布对齐时，我们方法的训练目标函数的全局最优解才能达到。实证上，我们在多个基准数据集上评估了我们的方法，包括HuggingFace Open LLM排行榜、MT-Bench以及来自Big-Bench的数据集。我们的结果表明，SPIN能够显著提升LLM在各种基准测试中的性能，甚至优于通过直接偏好优化（DPO）并辅以额外GPT-4偏好数据训练的模型。这揭示了自博弈的潜力，使得LLM无需专家对手即可达到人类水平的性能。代码可在 https://github.com/uclaml/SPIN 获取。