Instruction tuning is a standard technique employed to align large language models to end tasks and user preferences after the initial pretraining phase. Recent research indicates the critical role of data engineering in instruction tuning -- when appropriately selected, only limited data is necessary to achieve superior performance. However, we still lack a principled understanding of what makes good instruction tuning data for alignment, and how we should select data automatically and effectively. In this work, we delve deeply into automatic data selection strategies for alignment. We start with controlled studies to measure data across three dimensions: complexity, quality, and diversity, along which we examine existing methods and introduce novel techniques for enhanced data measurement. Subsequently, we propose a simple strategy to select data samples based on the measurement. We present deita (short for Data-Efficient Instruction Tuning for Alignment), a series of models fine-tuned from LLaMA and Mistral models using data samples automatically selected with our proposed approach. Empirically, deita performs better or on par with the state-of-the-art open-source alignment models with only 6K SFT training data samples -- over 10x less than the data used in the baselines. When further trained with direct preference optimization (DPO), deita-Mistral-7B + DPO trained with 6K SFT and 10K DPO samples achieve 7.55 MT-Bench and 90.06% AlpacaEval scores. We anticipate this work to provide tools on automatic data selection, facilitating data-efficient alignment. We release our models as well as the selected datasets for future researches to effectively align models more efficiently.
翻译:指令微调是一种在初始预训练阶段后,将大型语言模型与目标任务和用户偏好对齐的标准技术。近期研究表明,数据工程在指令微调中扮演关键角色——当数据得到恰当选择时,仅需有限样本即可实现卓越性能。然而,我们仍缺乏对什么构成良好对齐指令微调数据的原理性理解,以及如何自动且有效地选择数据。本研究深入探索对齐过程中的自动数据选择策略。我们从控制性实验出发,沿三个维度(复杂度、质量和多样性)衡量数据,在此框架下审视现有方法并引入增强数据测量的新技术。随后,我们提出一种基于数据测量的简单样本选择策略。我们提出deita(Data-Efficient Instruction Tuning for Alignment的缩写),这是一系列通过本文方法自动选择数据样本,从LLaMA和Mistral模型微调得到的模型。实验表明,deita仅使用6K SFT训练数据样本(较基线方法减少超10倍数据量)即可达到或超越最先进开源对齐模型的性能。当进一步结合直接偏好优化(DPO)时,使用6K SFT和10K DPO样本训练的deita-Mistral-7B+DPO模型获得7.55的MT-Bench得分和90.06%的AlpacaEval评分。我们期待本研究能为自动数据选择提供工具,促进数据高效对齐。我们已开源模型及所选数据集,助力未来研究更高效地实现模型对齐。