Language models achieve impressive results in tasks involving complex multistep reasoning, but scaling these capabilities further traditionally requires expensive collection of more annotated data. In this work, we explore the potential of improving the capabilities of language models without new data, merely using automated feedback to the validity of their predictions in arithmetic reasoning (self-training). We find that models can substantially improve in both single-round (offline) and online self-training. In the offline setting, supervised methods are able to deliver gains comparable to preference optimization, but in online self-training, preference optimization shows to largely outperform supervised training thanks to superior stability and robustness on unseen types of problems.
翻译:语言模型在涉及复杂多步推理的任务中取得了令人瞩目的成果,但传统上要进一步扩展这些能力,需要耗费高昂的成本来收集更多标注数据。在本工作中,我们探索了在不引入新数据的情况下,仅通过对其在算术推理中预测有效性的自动化反馈(自训练)来提升语言模型能力的潜力。我们发现,模型在单轮(离线)和在线自训练中均能取得显著提升。在离线设置下,监督方法能够取得与偏好优化相当的增益;但在在线自训练中,得益于对未见问题类型具有更优的稳定性和鲁棒性,偏好优化方法的表现大幅超越监督训练。