In the instruction fine-tuning of large language models (LLMs), it is widely recognized that a few high-quality instructions are superior to a large number of low-quality instructions. At present, many instruction selection methods have been proposed, but most of these methods select instruction based on heuristic quality metrics, and only consider data selection before training. These designs lead to insufficient optimization of instruction fine-tuning, and fixed heuristic indicators are often difficult to optimize for specific tasks. Therefore, we design a dynamic, task-objective-driven instruction selection framework RAISE(Reinforced Adaptive Instruction SElection), which incorporates the entire instruction fine-tuning process into optimization, selecting instructions at each step based on the expected impact of each instruction on model performance improvement. Our approach is well interpretable and has strong task-specific optimization capabilities. By modeling dynamic instruction selection as a sequential decision-making process, we use RL to train our selection strategy. Extensive experiments and result analysis prove the superiority of our method compared with other instruction selection methods. Notably, RAISE achieves superior performance by updating only 1% of the training steps compared to full-data training, demonstrating its efficiency and effectiveness.
翻译:在大语言模型(LLM)的指令微调中,学界普遍认为少量高质量指令优于大量低质量指令。目前已有多种指令选择方法被提出,但多数方法基于启发式质量指标进行选择,且仅考虑训练前的数据筛选。这些设计导致指令微调的优化不足,而固定的启发式指标往往难以针对特定任务进行优化。为此,我们设计了一种动态的、任务目标驱动的指令选择框架RAISE(强化自适应指令选择),该框架将完整的指令微调过程纳入优化范畴,依据每条指令对模型性能提升的预期影响,在每一步训练中动态选择指令。我们的方法具有良好的可解释性,并具备强大的任务针对性优化能力。通过将动态指令选择建模为序列决策过程,我们采用强化学习训练选择策略。大量实验与结果分析证明了本方法相较于其他指令选择方法的优越性。值得注意的是,RAISE仅需更新1%的训练步数即可达到优于全数据训练的性能,充分体现了其高效性与有效性。