Finetuning pre-trained language models (LMs) enhances the models' capabilities. Prior techniques fine-tune a pre-trained LM on input-output pairs (e.g., instruction fine-tuning), or with numerical rewards that gauge the quality of its outputs (e.g., reinforcement learning from human feedback). We explore LMs' potential to learn from textual interactions (LeTI) that not only check their correctness with binary labels, but also pinpoint and explain errors in their outputs through textual feedback. Our investigation focuses on the code generation task, where the model produces code pieces in response to natural language instructions. This setting invites a natural and scalable way to acquire the textual feedback: the error messages and stack traces from code execution using a Python interpreter. LeTI iteratively fine-tunes the model, using the LM objective, on a concatenation of natural language instructions, LM-generated programs, and textual feedback, which is only provided when the generated program fails to solve the task. Prepended to this fine-tuning text, a binary reward token is used to differentiate correct and buggy solutions. On MBPP, a code generation dataset, LeTI substantially improves the performance of two base LMs of different scales. LeTI requires no ground-truth outputs for training and even outperforms a fine-tuned baseline that does. LeTI's strong performance generalizes to other datasets. Trained on MBPP, it achieves comparable or better performance than the base LMs on unseen problems in HumanEval. Furthermore, compared to binary feedback, we observe that textual feedback leads to improved generation quality and sample efficiency, achieving the same performance with fewer than half of the gradient steps. LeTI is equally applicable in natural language tasks when they can be formulated as code generation, which we empirically verified on event argument extraction.
翻译:微调预训练语言模型(LM)能够增强模型的各项能力。以往的技术通过输入-输出对(例如指令微调)或通过衡量其输出质量的数值奖励(例如基于人类反馈的强化学习)来微调预训练语言模型。我们探索了语言模型从文本交互(LeTI)中学习的潜力,这种交互不仅通过二元标签检查其正确性,还通过文本反馈精确定位并解释其输出中的错误。我们的研究聚焦于代码生成任务,即模型根据自然语言指令生成代码片段。这一设定提供了一种自然且可扩展的获取文本反馈的方式:通过使用Python解释器执行代码时产生的错误信息和堆栈跟踪。LeTI使用语言模型目标函数,在自然语言指令、语言模型生成的程序以及文本反馈(仅在生成的程序未能解决任务时提供)的拼接文本上迭代地对模型进行微调。在此微调文本前附加一个二元奖励令牌,用于区分正确和错误的解决方案。在代码生成数据集MBPP上,LeTI显著提升了两种不同规模的基础语言模型的性能。LeTI的训练无需真实输出,其性能甚至优于使用了真实输出进行微调的基线模型。LeTI的强大性能可泛化至其他数据集。在MBPP上训练后,它在HumanEval中的未见问题上取得了与基础语言模型相当或更优的性能。此外,与二元反馈相比,我们观察到文本反馈能够提升生成质量和样本效率,在不到一半的梯度步数内即可达到相同性能。当自然语言任务可以被构建为代码生成形式时,LeTI同样适用,我们已在事件论元抽取任务上通过实验验证了这一点。