Fine-tuning language models~(LMs) on human-generated data remains a prevalent practice. However, the performance of such models is often limited by the quantity and diversity of high-quality human data. In this paper, we explore whether we can go beyond human data on tasks where we have access to scalar feedback, for example, on math problems where one can verify correctness. To do so, we investigate a simple self-training method based on expectation-maximization, which we call ReST$^{EM}$, where we (1) generate samples from the model and filter them using binary feedback, (2) fine-tune the model on these samples, and (3) repeat this process a few times. Testing on advanced MATH reasoning and APPS coding benchmarks using PaLM-2 models, we find that ReST$^{EM}$ scales favorably with model size and significantly surpasses fine-tuning only on human data. Overall, our findings suggest self-training with feedback can substantially reduce dependence on human-generated data.
翻译:微调语言模型(LMs)依赖人工生成数据仍是当前主流实践。然而,这类模型的性能常受限于高质量人工数据的数量与多样性。本文探索在可获取标量反馈的任务中(例如可通过验证正确性的数学问题),能否超越人工数据的局限。为此,我们提出基于期望最大化思想的简单自训练方法ReST$^{EM}$:(1)利用模型生成样本并通过二元反馈进行筛选,(2)在筛选样本上微调模型,(3)重复该过程若干次。基于PaLM-2模型在高级数学推理(MATH)和编程任务(APPS)基准上的测试表明,ReST$^{EM}$方法的扩展性随模型规模提升而显著增强,其性能远优于仅依赖人工数据的微调。总体而言,本研究证明带有反馈的自训练可大幅降低对人工生成数据的依赖。