Instruction tuning has become a foundation for unlocking the capabilities of large-scale pretrained models and improving their performance on complex tasks. Thus, the construction of high-quality instruction datasets is crucial for enhancing model performance and generalizability. Although current instruction datasets have reached tens of millions of samples, models finetuned on them may still struggle with complex instruction following and tasks in rare domains. This is primarily due to limited expansion in both ``coverage'' (coverage of task types and knowledge areas) and ``depth'' (instruction complexity) of the instruction set. To address this issue, we propose a systematic instruction data construction framework, which integrates a hierarchical tagging system, an informative seed selection algorithm, an evolutionary data synthesis process, and a model deficiency diagnosis with targeted data generation. These components form an iterative closed-loop to continuously enhance the coverage and depth of instruction data. Based on this framework, we construct Infinity Instruct Subject, a high-quality dataset containing $\sim$1.5 million instructions. Experiments on multiple foundation models and benchmark tasks demonstrate its effectiveness in improving instruction-following capabilities. Further analyses suggest that Infinity Instruct Subject shows enlarged coverage and depth compared to comparable synthesized instruction datasets. Our work lays a theoretical and practical foundation for the efficient, continuous evolution of instruction datasets, moving from data quantity expansion to qualitative improvement.
翻译:指令调优已成为解锁大规模预训练模型能力并提升其在复杂任务上性能的基础。因此,构建高质量的指令数据集对于增强模型性能和泛化能力至关重要。尽管当前指令数据集已达到数千万样本规模,但基于其微调的模型在处理复杂指令遵循及罕见领域任务时仍面临困难。这主要源于指令集在“广度”(任务类型与知识领域覆盖)和“深度”(指令复杂度)两个维度上的扩展受限。为解决此问题,我们提出一种系统化的指令数据构建框架,该框架整合了分层标注体系、信息性种子选择算法、演化式数据合成流程,以及结合针对性数据生成的模型缺陷诊断机制。这些组件形成迭代闭环,持续提升指令数据的广度与深度。基于此框架,我们构建了包含约150万条指令的高质量数据集Infinity Instruct Subject。在多个基础模型和基准任务上的实验验证了其提升指令遵循能力的有效性。进一步分析表明,相较于同类合成指令数据集,Infinity Instruct Subject展现出更广的覆盖范围和更深的指令复杂度。本研究为指令数据集从数据量扩张向质量提升的高效持续演进奠定了理论与实践基础。