Recently, self-play fine-tuning (SPIN) has been proposed to adapt large language models to downstream applications with scarce expert-annotated data, by iteratively generating synthetic responses from the model itself. However, SPIN is designed to optimize the current reward advantages of annotated responses over synthetic responses at hand, which may gradually vanish during iterations, leading to unstable optimization. Moreover, the utilization of reference policy induces a misalignment issue between the reward formulation for training and the metric for generation. To address these limitations, we propose a novel Triplet-based Self-Play fIne-tuNing (T-SPIN) method that integrates two key designs. First, beyond current advantages, T-SPIN additionally incorporates historical advantages between iteratively generated responses and proto-synthetic responses produced by the initial policy. Even if the current advantages diminish, historical advantages remain effective, stabilizing the overall optimization. Second, T-SPIN introduces the entropy constraint into the self-play framework, which is theoretically justified to support reference-free fine-tuning, eliminating the training-generation discrepancy. Empirical results on various tasks demonstrate not only the superior performance of T-SPIN over SPIN, but also its stable evolution during iterations. Remarkably, compared to supervised fine-tuning, T-SPIN achieves comparable or even better performance with only 25% samples, highlighting its effectiveness when faced with scarce annotated data.
翻译:近期提出的自博弈微调(SPIN)方法通过迭代生成模型自身的合成响应,使大语言模型能够适应专家标注数据稀缺的下游应用。然而,SPIN旨在优化当前标注响应相对于现有合成响应的奖励优势,这种优势在迭代过程中可能逐渐消失,导致优化过程不稳定。此外,参考策略的使用引发了训练阶段奖励构建与生成阶段评估指标之间的错位问题。为克服这些局限,本文提出一种新颖的三元组自博弈微调(T-SPIN)方法,其整合了两项关键设计。首先,除当前优势外,T-SPIN额外引入了迭代生成响应与初始策略生成的原型合成响应之间的历史优势。即使当前优势减弱,历史优势仍能持续生效,从而稳定整体优化过程。其次,T-SPIN将熵约束引入自博弈框架,该设计在理论上支持无参考微调,消除了训练与生成阶段的差异。多任务实验结果表明,T-SPIN不仅性能优于SPIN,且在迭代过程中保持稳定演化。值得注意的是,与监督微调相比,T-SPIN仅需25%的标注样本即可达到相当甚至更优的性能,这凸显了其在标注数据稀缺场景下的有效性。