RL on Incorrect Synthetic Data Scales the Efficiency of LLM Math Reasoning by Eight-Fold

Training on model-generated synthetic data is a promising approach for finetuning LLMs, but it remains unclear when it helps or hurts. In this paper, we investigate this question for math reasoning via an empirical study, followed by building a conceptual understanding of our observations. First, we find that while the typical approach of finetuning a model on synthetic correct or positive problem-solution pairs generated by capable models offers modest performance gains, sampling more correct solutions from the finetuned learner itself followed by subsequent fine-tuning on this self-generated data $\textbf{doubles}$ the efficiency of the same synthetic problems. At the same time, training on model-generated positives can amplify various spurious correlations, resulting in flat or even inverse scaling trends as the amount of data increases. Surprisingly, we find that several of these issues can be addressed if we also utilize negative responses, i.e., model-generated responses that are deemed incorrect by a final answer verifier. Crucially, these negatives must be constructed such that the training can appropriately recover the utility or advantage of each intermediate step in the negative response. With this per-step scheme, we are able to attain consistent gains over only positive data, attaining performance similar to amplifying the amount of synthetic data by $\mathbf{8 \times}$. We show that training on per-step negatives can help to unlearn spurious correlations in the positive data, and is equivalent to advantage-weighted reinforcement learning (RL), implying that it inherits robustness benefits of RL over imitating positive data alone.

翻译：在模型生成的合成数据上进行训练是一种有前景的大语言模型微调方法，但其何时有益、何时有害尚不明确。本文通过实证研究探讨了数学推理领域的这一问题，并在此基础上构建了对观测结果的概念性理解。首先，我们发现，虽然常用的方法（即在由能力强模型生成的合成正确或正向问题-解决方案对上微调模型）能带来有限的性能提升，但从微调后的学习器本身采样更多正确解，随后在此自生成数据上进行进一步微调，可使相同合成问题的效率$\textbf{翻倍}$。与此同时，在模型生成的正向数据上训练可能放大各种虚假相关性，导致随着数据量增加出现增长停滞甚至负增长的趋势。令人惊讶的是，我们发现如果同时利用负向响应（即被最终答案验证器判定为错误的模型生成响应），上述若干问题可以得到解决。关键在于，这些负向样本的构建必须使得训练能够恰当地恢复负向响应中每个中间步骤的效用或优势。采用这种按步处理方案，我们能够获得相对于仅使用正向数据的一致性能提升，达到相当于将合成数据量放大$\mathbf{8 \times}$的效果。我们证明，在按步负向数据上训练有助于消除正向数据中的虚假相关性，并且等价于优势加权强化学习，这意味着其继承了强化学习相对于单纯模仿正向数据的鲁棒性优势。