Fine-tuning language models~(LMs) on human-generated data remains a prevalent practice. However, the performance of such models is often limited by the quantity and diversity of high-quality human data. In this paper, we explore whether we can go beyond human data on tasks where we have access to scalar feedback, for example, on math problems where one can verify correctness. To do so, we investigate a simple self-training method based on expectation-maximization, which we call ReST$^{EM}$, where we (1) generate samples from the model and filter them using binary feedback, (2) fine-tune the model on these samples, and (3) repeat this process a few times. Testing on advanced MATH reasoning and APPS coding benchmarks using PaLM-2 models, we find that ReST$^{EM}$ scales favorably with model size and significantly surpasses fine-tuning only on human data. Overall, our findings suggest self-training with feedback can substantially reduce dependence on human-generated data.
翻译:微调语言模型(LMs)于人类生成数据仍是常见做法。然而,此类模型性能常受限于高质量人类数据的数量与多样性。本文探索在可获取标量反馈(例如可验证正确性的数学问题)的任务中,能否超越人类数据。为此,我们提出一种基于期望最大化的简单自训练方法——ReST$^{EM}$:其流程包括(1)从模型生成样本并通过二元反馈筛选,(2)基于这些样本对模型进行微调,(3)重复此过程若干次。在PaLM-2模型上对高级数学推理(MATH)与APPS编程基准的测试表明,ReST$^{EM}$随模型规模扩大呈现有利的扩展性,且显著超越仅依赖人类数据的微调效果。总体而言,本研究揭示结合反馈的自训练可大幅降低对人类生成数据的依赖。