Harnessing the power of human-annotated data through Supervised Fine-Tuning (SFT) is pivotal for advancing Large Language Models (LLMs). In this paper, we delve into the prospect of growing a strong LLM out of a weak one without the need for acquiring additional human-annotated data. We propose a new fine-tuning method called Self-Play fIne-tuNing (SPIN), which starts from a supervised fine-tuned model. At the heart of SPIN lies a self-play mechanism, where the LLM refines its capability by playing against instances of itself. More specifically, the LLM generates its own training data from its previous iterations, refining its policy by discerning these self-generated responses from those obtained from human-annotated data. Our method progressively elevates the LLM from a nascent model to a formidable one, unlocking the full potential of human-annotated demonstration data for SFT. Theoretically, we prove that the global optimum to the training objective function of our method is achieved only when the LLM policy aligns with the target data distribution. Empirically, we evaluate our method on several benchmark datasets including the HuggingFace Open LLM Leaderboard, MT-Bench, and datasets from Big-Bench. Our results show that SPIN can significantly improve the LLM's performance across a variety of benchmarks and even outperform models trained through direct preference optimization (DPO) supplemented with extra GPT-4 preference data. This sheds light on the promise of self-play, enabling the achievement of human-level performance in LLMs without the need for expert opponents.
翻译:利用人类注释数据的监督微调(SFT)对于推动大型语言模型(LLMs)的发展至关重要。本文探讨了在无需额外获取人类注释数据的前提下,从弱模型培育出强语言模型的前景。我们提出了一种名为自我对弈微调(SPIN)的新型微调方法,该方法以经过监督微调的模型为起点。SPIN的核心在于自我对弈机制,即语言模型通过与自身的实例对抗来提升自身能力。具体而言,语言模型从先前迭代中生成自己的训练数据,通过区分这些自生成响应与人类注释数据获得的响应来优化其策略。我们的方法逐步将语言模型从初生模型提升至强大模型,充分释放人类注释演示数据在SFT中的潜力。在理论上,我们证明了仅当语言模型策略与目标数据分布一致时,我们方法的训练目标函数才能达到全局最优。在实证方面,我们在多个基准数据集(包括HuggingFace Open LLM Leaderboard、MT-Bench以及Big-Bench中的数据集)上评估了我们的方法。结果表明,SPIN能显著提升语言模型在各类基准测试中的性能,甚至优于通过直接偏好优化(DPO)并补充额外GPT-4偏好数据训练的模型。这揭示了自我对弈的潜力,使得语言模型无需专家对手即可达到人类水平的表现。