Scaling Towards the Information Boundary of Instruction Sets: The Infinity Instruct Subject Technical Report

Instruction tuning has become a foundation for unlocking the capabilities of large-scale pretrained models and improving their performance on complex tasks. Thus, the construction of high-quality instruction datasets is crucial for enhancing model performance and generalizability. Although current instruction datasets have reached tens of millions of samples, models finetuned on them may still struggle with complex instruction following and tasks in rare domains. This is primarily due to limited expansion in both ``coverage'' (coverage of task types and knowledge areas) and ``depth'' (instruction complexity) of the instruction set. To address this issue, we propose a systematic instruction data construction framework, which integrates a hierarchical tagging system, an informative seed selection algorithm, an evolutionary data synthesis process, and a model deficiency diagnosis with targeted data generation. These components form an iterative closed-loop to continuously enhance the coverage and depth of instruction data. Based on this framework, we construct Infinity Instruct Subject, a high-quality dataset containing $\sim$1.5 million instructions. Experiments on multiple foundation models and benchmark tasks demonstrate its effectiveness in improving instruction-following capabilities. Further analyses suggest that Infinity Instruct Subject shows enlarged coverage and depth compared to comparable synthesized instruction datasets. Our work lays a theoretical and practical foundation for the efficient, continuous evolution of instruction datasets, moving from data quantity expansion to qualitative improvement.

翻译：指令调优已成为解锁大规模预训练模型能力并提升其在复杂任务上性能的基础。因此，构建高质量的指令数据集对于增强模型性能和泛化能力至关重要。尽管当前指令数据集已达到数千万样本规模，但基于其微调的模型在处理复杂指令遵循及罕见领域任务时仍面临困难。这主要源于指令集在“广度”（任务类型与知识领域覆盖）和“深度”（指令复杂度）两个维度上的扩展受限。为解决此问题，我们提出一种系统化的指令数据构建框架，该框架整合了分层标注体系、信息性种子选择算法、演化式数据合成流程，以及结合针对性数据生成的模型缺陷诊断机制。这些组件形成迭代闭环，持续提升指令数据的广度与深度。基于此框架，我们构建了包含约150万条指令的高质量数据集Infinity Instruct Subject。在多个基础模型和基准任务上的实验验证了其提升指令遵循能力的有效性。进一步分析表明，相较于同类合成指令数据集，Infinity Instruct Subject展现出更广的覆盖范围和更深的指令复杂度。本研究为指令数据集从数据量扩张向质量提升的高效持续演进奠定了理论与实践基础。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

推荐！《不确定性条件下的联合多域作战规划：自适应与模块化》最新174页博士论文

专知会员服务

51+阅读 · 2025年9月8日

高质量数据集实践指南（1.0）

专知会员服务

32+阅读 · 2025年7月25日

《泛域指挥控制决策中情境感知与情境识别综述》最新55页综述报告

专知会员服务

38+阅读 · 2025年5月19日

联邦学习中基础模型参数高效微调综述

专知会员服务

17+阅读 · 2025年5月5日