Instruction tuning is crucial for aligning Large Language Models (LLMs), yet the quality of instruction-following data varies significantly. While high-quality data is paramount, it is often scarce; conversely, abundant low-quality data is frequently discarded, leading to substantial information loss. Existing data augmentation methods struggle to augment this low-quality data effectively, and the evaluation of such techniques remains poorly defined. To address this, we formally define the task of Instruction Distillation: distilling multiple low-quality and redundant inputs into high-quality and coherent instruction-output pairs. Specifically, we introduce a comprehensive data construction pipeline to create MIXTURE, a 144K-sample dataset pairing low-quality or semantically redundant imperfect instruction clusters with their high-quality distillations. We then introduce LM-Mixup, by first performing supervised fine-tuning on MIXTURE and then optimizing it with reinforcement learning. This process uses three complementary reward signals: quality, semantic alignment, and format compliance, via Group Relative Policy Optimization (GRPO). We demonstrate that LM-Mixup effectively augments imperfect datasets: fine-tuning LLMs on its distilled data, which accounts for only about 3% of the entire dataset, not only surpasses full-dataset training but also competes with state-of-the-art high-quality data selection methods across multiple benchmarks. Our work establishes that low-quality data is a valuable resource when properly distilled and augmented with LM-Mixup, significantly enhancing the efficiency and performance of instruction-tuned LLMs.
翻译:指令调优对于对齐大型语言模型(LLM)至关重要,但指令跟随数据的质量差异显著。高质量数据虽至关重要却往往稀缺;反之,大量低质量数据常被丢弃,导致严重信息损失。现有数据增强方法难以有效增强此类低质量数据,且相关技术的评估标准仍不明确。为此,我们正式定义指令蒸馏任务:将多个低质量冗余输入提炼为高质量连贯的指令-输出对。具体而言,我们引入完整的数据构建流程,创建包含14.4万样本的MIXTURE数据集,该数据集将低质量或语义冗余的不完美指令簇与其高质量蒸馏结果配对。随后我们提出LM-Mixup方法:首先在MIXTURE上进行监督微调,再通过强化学习进行优化。该过程通过群组相对策略优化(GRPO)融合三种互补奖励信号:质量奖励、语义对齐奖励与格式合规奖励。实验表明LM-Mixup能有效增强缺陷数据集:仅使用其蒸馏数据(约占全数据集3%)微调LLM,不仅超越全数据集训练效果,还在多个基准测试中与最先进的高质量数据选择方法性能相当。本研究证实:通过LM-Mixup进行适当蒸馏与增强后,低质量数据可成为宝贵资源,能显著提升指令调优LLM的效能与性能。