Large Language Models (LLMs) have demonstrated remarkable abilities in general scenarios. Instruction finetuning empowers them to align with humans in various tasks. Nevertheless, the Diversity and Quality of the instruction data remain two main challenges for instruction finetuning. With regard to this, in this paper, we propose a novel gradient-based method to automatically select high-quality and diverse instruction finetuning data for machine translation. Our key innovation centers around analyzing how individual training examples influence the model during training. Specifically, we select training examples that exert beneficial influences on the model as high-quality ones by means of Influence Function plus a small high-quality seed dataset. Moreover, to enhance the diversity of the training data we maximize the variety of influences they have on the model by clustering on their gradients and resampling. Extensive experiments on WMT22 and FLORES translation tasks demonstrate the superiority of our methods, and in-depth analysis further validates their effectiveness and generalization.
翻译:大型语言模型(LLMs)在通用场景中展现出卓越能力。指令微调使其能够在各类任务中与人类意图对齐。然而,指令数据的多样性与质量仍是指令微调面临的两大挑战。针对这一问题,本文提出一种新颖的基于梯度的方法,用于自动选择高质量且多样化的机器翻译指令微调数据。本文的核心创新在于分析单个训练样本在模型训练过程中产生的影响。具体而言,我们通过影响函数与少量高质量种子数据集,选择对模型产生有益影响的训练样本作为高质量数据。此外,为增强训练数据的多样性,我们通过对样本梯度进行聚类并重新采样,最大化其对模型影响方式的多样性。在WMT22和FLORES翻译任务上的大量实验证明了本文方法的优越性,深入分析进一步验证了其有效性与泛化能力。