Instruction tuning, a specialized technique to enhance large language model (LLM) performance via instruction datasets, relies heavily on the quality of employed data. Existing quality improvement methods alter instruction data through dataset expansion or curation. However, the expansion method risks data redundancy, potentially compromising LLM performance, while the curation approach confines the LLM's potential to the original dataset. Our aim is to surpass the original data quality without encountering these shortcomings. To achieve this, we propose LIFT (LLM Instruction Fusion Transfer), a novel and versatile paradigm designed to elevate the instruction quality to new heights. LIFT strategically broadens data distribution to encompass more high-quality subspaces and eliminates redundancy, concentrating on high-quality segments across overall data subspaces. Experimental results demonstrate that, even with a limited quantity of high-quality instruction data selected by our paradigm, LLMs not only consistently uphold robust performance across various tasks but also surpass some state-of-the-art results, highlighting the significant improvement in instruction quality achieved by our paradigm.
翻译:指令微调是一种通过指令数据集提升大语言模型(LLM)性能的专门技术,其效果高度依赖于所使用数据的质量。现有的质量改进方法通过数据集扩展或筛选来调整指令数据。然而,扩展方法存在数据冗余风险,可能损害LLM性能;而筛选方法则将LLM的潜力限制在原始数据集范围内。我们的目标是超越原始数据质量,同时避免这些缺陷。为此,我们提出LIFT(LLM指令融合迁移)——一种旨在将指令质量提升至新高度的新型通用范式。LIFT策略性地扩展数据分布以覆盖更广泛的高质量子空间,同时消除冗余,聚焦于整体数据子空间中的高质量片段。实验结果表明,即使仅使用通过我们范式筛选的少量高质量指令数据,LLM不仅能持续在各种任务中保持稳健性能,还能超越部分最先进的结果,这凸显了我们的范式在提升指令质量方面的显著成效。