Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models

Harnessing the power of human-annotated data through Supervised Fine-Tuning (SFT) is pivotal for advancing Large Language Models (LLMs). In this paper, we delve into the prospect of growing a strong LLM out of a weak one without the need for acquiring additional human-annotated data. We propose a new fine-tuning method called Self-Play fIne-tuNing (SPIN), which starts from a supervised fine-tuned model. At the heart of SPIN lies a self-play mechanism, where the LLM refines its capability by playing against instances of itself. More specifically, the LLM generates its own training data from its previous iterations, refining its policy by discerning these self-generated responses from those obtained from human-annotated data. Our method progressively elevates the LLM from a nascent model to a formidable one, unlocking the full potential of human-annotated demonstration data for SFT. Theoretically, we prove that the global optimum to the training objective function of our method is achieved only when the LLM policy aligns with the target data distribution. Empirically, we evaluate our method on several benchmark datasets including the HuggingFace Open LLM Leaderboard, MT-Bench, and datasets from Big-Bench. Our results show that SPIN can significantly improve the LLM's performance across a variety of benchmarks and even outperform models trained through direct preference optimization (DPO) supplemented with extra GPT-4 preference data. This sheds light on the promise of self-play, enabling the achievement of human-level performance in LLMs without the need for expert opponents. Codes are available at https://github.com/uclaml/SPIN.

翻译：通过监督微调（SFT）利用人工标注数据的力量对于推进大型语言模型（LLM）的发展至关重要。本文深入探讨了在无需额外获取人工标注数据的情况下，如何从弱模型培育出强LLM的可能性。我们提出了一种名为自我博弈微调（SPIN）的新型微调方法，该方法从监督微调模型开始。SPIN的核心在于自我博弈机制，即LLM通过与自身的实例对抗来优化自身能力。具体而言，LLM利用其先前迭代生成的训练数据，通过区分这些自我生成响应与人工标注数据中的响应来优化其策略。我们的方法逐步将LLM从初期模型提升为强大模型，充分挖掘了SFT中人工标注演示数据的潜力。理论上，我们证明了只有当LLM策略与目标数据分布一致时，该方法训练目标函数的全局最优解才能实现。在实证方面，我们在多个基准数据集（包括HuggingFace Open LLM排行榜、MT-Bench以及Big-Bench数据集）上评估了该方法。结果表明，SPIN能显著提升LLM在各类基准测试中的性能，甚至超越那些通过直接偏好优化（DPO）结合额外GPT-4偏好数据训练的模型。这揭示了自我博弈的前景，使LLM无需专家对手即可达到人类水平性能。代码已公开于https://github.com/uclaml/SPIN。

相关内容

大语言模型

关注 67

大语言模型是基于海量文本数据训练的深度学习模型。它不仅能够生成自然语言文本，还能够深入理解文本含义，处理各种自然语言任务，如文本摘要、问答、翻译等。2023年，大语言模型及其在人工智能领域的应用已成为全球科技研究的热点，其在规模上的增长尤为引人注目，参数量已从最初的十几亿跃升到如今的一万亿。参数量的提升使得模型能够更加精细地捕捉人类语言微妙之处，更加深入地理解人类语言的复杂性。在过去的一年里，大语言模型在吸纳新知识、分解复杂任务以及图文对齐等多方面都有显著提升。随着技术的不断成熟，它将不断拓展其应用范围，为人类提供更加智能化和个性化的服务，进一步改善人们的生活和生产方式。

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

14+阅读 · 2022年3月12日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日