The recent advancement of large language models (LLMs) has been achieved through a combo of instruction tuning and human alignment. However, building manually crafted instruction datasets and performing human alignment become the bottleneck for scaling the development of LLMs. In this paper, we exploit the idea of leveraging AI models in lieu of humans as the teacher to train student LLMs. Our method is inspired by how human students refine their writing skills by following the rubrics and learning from the revisions offered by their tutors. Specifically, we employ a teacher LLM to create a curriculum for instruction tuning of the student LLM, namely Curriculum Instruction TunING (CITING). It encompasses two main steps: (1) the teacher LLM crafts the rubrics for evaluating the answers corresponding to various types of questions, and (2) the student LLM learns to follow the rubrics and perform self-correction from the revision made by the teacher. We further iteratively carry out it to embody the procedure of CITING. We compare CITING to a series of state-of-the-art baselines on four datasets. Our method demonstrates strong improvement in terms of articulate, in-depth, and comprehensive by GPT-4 evaluation. Specifically, it achieves an average winning rate of 79.4% over SFT, 73.4% over RLHF, 78.1% over RRHF, and 76.3% over RAFT, respectively.
翻译:大型语言模型(LLMs)的最新进展是通过指令微调与人类对齐的组合实现的。然而,构建人工精心设计的指令数据集以及执行人类对齐,成为扩大LLMs开发的瓶颈。在本文中,我们探索了利用AI模型替代人类作为教师来训练学生LLMs的思路。我们的方法灵感来源于人类学生如何通过遵循评分标准并从导师提供的修改中学习,来改进其写作技能。具体而言,我们采用教师LLM为学生LLM的指令微调创建课程,即课程指令微调(CITING)。它包含两个主要步骤:(1)教师LLM制定评分标准,用于评估各类问题对应的答案;(2)学生LLM学习遵循这些评分标准,并从教师所做的修改中执行自我纠正。我们进一步迭代执行此过程,以体现CITING的流程。我们在四个数据集上将CITING与一系列最先进的基线方法进行了比较。根据GPT-4的评估,我们的方法在表达清晰、深入和全面方面表现出显著提升。具体而言,与SFT、RLHF、RRHF和RAFT相比,它分别实现了平均79.4%、73.4%、78.1%和76.3%的胜率。