Instruction tuning is a standard technique employed to align large language models to end tasks and user preferences after the initial pretraining phase. Recent research indicates the critical role of data engineering in instruction tuning -- when appropriately selected, only limited data is necessary to achieve superior performance. However, we still lack a principled understanding of what makes good instruction tuning data for alignment, and how we should select data automatically and effectively. In this work, we delve deeply into automatic data selection strategies for alignment. We start with controlled studies to measure data across three dimensions: complexity, quality, and diversity, along which we examine existing methods and introduce novel techniques for enhanced data measurement. Subsequently, we propose a simple strategy to select data samples based on the measurement. We present deita (short for Data-Efficient Instruction Tuning for Alignment), a series of models fine-tuned from LLaMA and Mistral models using data samples automatically selected with our proposed approach. Empirically, deita performs better or on par with the state-of-the-art open-source alignment models with only 6K SFT training data samples -- over 10x less than the data used in the baselines. When further trained with direct preference optimization (DPO), deita-Mistral-7B + DPO trained with 6K SFT and 10K DPO samples achieve 7.55 MT-Bench and 90.06% AlpacaEval scores. We anticipate this work to provide tools on automatic data selection, facilitating data-efficient alignment. We release our models as well as the selected datasets for future researches to effectively align models more efficiently.
翻译:指令微调是一种标准技术,用于在初始预训练阶段之后将大语言模型与最终任务和用户偏好进行对齐。近期研究表明,数据工程在指令微调中起着关键作用——当数据被适当选择时,仅需有限的数据就能实现卓越性能。然而,我们仍缺乏对什么构成良好对齐指令微调数据的原理性理解,以及如何自动、有效地选择数据。在本工作中,我们深入研究了用于对齐的自动数据选择策略。我们首先通过受控研究从三个维度测量数据:复杂性、质量和多样性,在此过程中审视现有方法并引入增强数据测量的新技术。随后,我们提出了一种基于这些测量值选择数据样本的简单策略。我们提出了deita(Data-Efficient Instruction Tuning for Alignment的缩写),这是一系列使用我们提出的方法自动选择的数据样本从LLaMA和Mistral模型微调得到的模型。实验表明,deita仅使用6K SFT训练数据样本(比基线模型使用的数据少10倍以上)就能达到或优于最先进的开源对齐模型。当进一步使用直接偏好优化(DPO)训练时,使用6K SFT和10K DPO样本训练的deita-Mistral-7B + DPO在MT-Bench上达到7.55分,在AlpacaEval上达到90.06%的得分。我们希望这项工作能为自动数据选择提供工具,促进数据高效的对齐。我们开源了模型以及选定的数据集,以便未来研究能够更高效地进行模型对齐。