Instruction tuning of open-source large language models (LLMs) like LLaMA, using direct outputs from more powerful LLMs such as Instruct-GPT and GPT-4, has proven to be a cost-effective way to align model behaviors with human preferences. However, the instruction-tuned model has only seen one response per instruction, lacking the knowledge of potentially better responses. In this paper, we propose finetuning an instruction-tuned LLM using our novel \textit{probabilistic ranking} and \textit{contextual ranking} approaches to increase the likelihood of generating better responses. Probabilistic ranking enables the instruction-tuned model to inherit the relative rankings of high-quality and low-quality responses from the teacher LLM. On the other hand, learning with contextual ranking allows the model to refine its own response distribution using the contextual understanding ability of stronger LLMs. Furthermore, we apply probabilistic ranking and contextual ranking sequentially to the instruction-tuned LLM. The resulting model, which we call \textbf{Tuna}, consistently improves the performance on Super Natural Instructions (119 test tasks), LMentry (25 test tasks), Vicuna QA, and can even obtain better results than several strong reinforcement learning baselines. Our code and data are available at \url{ https://github.com/microsoft/LMOps}.
翻译:对LLaMA等开源大语言模型(LLMs)进行指令微调——通过使用更强大LLMs(如Instruct-GPT和GPT-4)的直接输出来对齐模型行为与人类偏好——被证明是一种经济有效的方案。然而,经指令微调的模型仅针对每条指令见过一个响应,缺乏对可能存在更优响应的认知。本文提出通过我们创新的\emph{概率排序}和\emph{上下文排序}方法微调指令微调后的LLM,以提高生成更优响应的可能性。概率排序使指令微调模型能够继承教师LLM对高质量与低质量响应的相对排序。另一方面,通过上下文排序进行学习允许模型利用更强LLM的上下文理解能力来优化自身的响应分布。此外,我们将概率排序和上下文排序依次应用于指令微调后的LLM。由此产生的模型称为\textbf{Tuna},在Super Natural Instructions(119项测试任务)、LMentry(25项测试任务)、Vicuna QA上持续提升性能,甚至可获得优于多个强强化学习基线的结果。我们的代码与数据已公布于\url{https://github.com/microsoft/LMOps}。