Data quality determines foundation model performance, yet systematic processing frameworks are lacking. We introduce Data Darwinism, a ten-level taxonomy (L0-L9) that conceptualizes data-model co-evolution: advanced models produce superior data for next-generation systems. We validate this on scientific literature by constructing Darwin-Science, a 900B-token corpus (L0-L5). We identify a learnability gap in raw scientific text, which we bridge via L4 (Generative Refinement) and L5 (Cognitive Completion) using frontier LLMs to explicate reasoning and terminology. To ensure rigorous attribution, we pre-trained daVinci-origin-3B/7B models from scratch, excluding scientific content to create contamination-free baselines. After 600B tokens of continued pre-training, Darwin-Science outperforms baselines by +2.12 (3B) and +2.95 (7B) points across 20+ benchmarks, rising to +5.60 and +8.40 points on domain-aligned tasks. Systematic progression to L5 yields a +1.36 total gain, confirming that higher-level processing unlocks latent data value. We release the Darwin-Science corpus and daVinci-origin models to enable principled, co-evolutionary development.
翻译:数据质量决定基础模型性能,但当前缺乏系统化的处理框架。我们提出数据达尔文主义,这是一个十级分类体系(L0-L9),用于概念化数据与模型的协同演化:先进模型能为下一代系统生成更优质的数据。我们通过在科学文献领域构建包含900B词元的达尔文-科学语料库(L0-L5)验证了这一理论。我们发现原始科学文本存在可学习性鸿沟,并利用前沿大语言模型通过L4(生成式精炼)和L5(认知补全)两个层级来显式化推理过程与术语体系,从而弥合这一差距。为确保严谨的归因分析,我们从零开始预训练了daVinci-origin-3B/7B模型,并排除所有科学内容以构建无污染基线。经过600B词元的持续预训练后,达尔文-科学模型在20余项基准测试中分别以+2.12(3B)和+2.95(7B)分的优势超越基线模型,在领域对齐任务上的优势进一步扩大至+5.60和+8.40分。系统化推进至L5层级可带来+1.36分的综合增益,证实了高层级数据处理能释放潜在数据价值。我们开源达尔文-科学语料库及daVinci-origin模型,以促进基于原则的协同演化研究。