Data Science and Technology Towards AGI Part I: Tiered Data Management

Yudong Wang,Zixuan Fu,Hengyu Zhao,Chen Zhao,Chuyue Zhou,Xinle Lin,Hongya Lyu,Shuaikang Xue,Yi Yi,Yingjiao Wang,Zhi Zheng,Yuzhou Zhang,Jie Zhou,Chaojun Xiao,Xu Han,Zhiyuan Liu,Maosong Sun

from arxiv, 16 pages, 3 figures, 7 tables

The development of artificial intelligence can be viewed as an evolution of data-driven learning paradigms, with successive shifts in data organization and utilization continuously driving advances in model capability. Current LLM research is dominated by a paradigm that relies heavily on unidirectional scaling of data size, increasingly encountering bottlenecks in data availability, acquisition cost, and training efficiency. In this work, we argue that the development of AGI is entering a new phase of data-model co-evolution, in which models actively guide data management while high-quality data, in turn, amplifies model capabilities. To implement this vision, we propose a tiered data management framework, designed to support the full LLM training lifecycle across heterogeneous learning objectives and cost constraints. Specifically, we introduce an L0-L4 tiered data management framework, ranging from raw uncurated resources to organized and verifiable knowledge. Importantly, LLMs are fully used in data management processes, such as quality scoring and content editing, to refine data across tiers. Each tier is characterized by distinct data properties, management strategies, and training roles, enabling data to be strategically allocated across LLM training stages, including pre-training, mid-training, and alignment. The framework balances data quality, acquisition cost, and marginal training benefit, providing a systematic approach to scalable and sustainable data management. We validate the effectiveness of the proposed framework through empirical studies, in which tiered datasets are constructed from raw corpora and used across multiple training phases. Experimental results demonstrate that tier-aware data utilization significantly improves training efficiency and model performance. To facilitate further research, we release our tiered datasets and processing tools to the community.

翻译：人工智能的发展可视为数据驱动学习范式的演进，数据组织与利用方式的连续变革持续推动着模型能力的进步。当前大语言模型研究主要依赖数据规模单向扩展的范式，日益面临数据可用性、获取成本与训练效率的瓶颈。本文认为通用人工智能的发展正进入数据-模型协同演进的新阶段：模型主动指导数据管理，而高质量数据则反哺模型能力提升。为实现这一愿景，我们提出分层数据管理框架，旨在支持异构学习目标与成本约束下的大语言模型全训练生命周期。具体而言，我们构建了涵盖原始未筛选资源到结构化可验证知识的L0-L4五层数据管理体系。值得关注的是，大语言模型被充分应用于数据管理流程（如质量评分与内容编辑），实现跨层级的数据精炼。各层级具有独特的数据属性、管理策略与训练职能，使数据能够策略性地分配至预训练、中期训练与对齐等大语言模型训练阶段。该框架平衡了数据质量、获取成本与边际训练收益，为可扩展且可持续的数据管理提供了系统化方案。我们通过实证研究验证了所提框架的有效性：从原始语料构建分层数据集并应用于多阶段训练。实验结果表明，基于层级感知的数据利用能显著提升训练效率与模型性能。为促进后续研究，我们向学界公开了分层数据集与处理工具。