MixtureVitae: Open Web-Scale Pretraining Dataset With High Quality Instruction and Reasoning Data Built from Permissive-First Text Sources

Huu Nguyen,Victor May,Harsh Raj,Marianna Nezhurina,Yishan Wang,Yanqi Luo,Minh Chien Vu,Taishi Nakamura,Ken Tsui,Van Khue Nguyen,David Salinas,Aleksandra Krasnodębska,Christoph Schuhmann,Mats Leon Richter, Xuan-Son, Vu,Jenia Jitsev

from arxiv, Code: \url{https://github.com/ontocord/mixturevitae}

We present MixtureVitae, an open-access pretraining corpus built to minimize legal risk while providing strong downstream performance. MixtureVitae follows a permissive-first, risk-mitigated sourcing strategy that combines public-domain and permissively licensed text (e.g., CC-BY/Apache) with carefully justified low-risk additions (e.g., government works and EU TDM-eligible sources). MixtureVitae adopts a simple, single-stage pretraining recipe that integrates a large proportion of permissive synthetic instruction and reasoning data-signals typically introduced during post-training and generally scarce in permissive web corpora. We categorize all sources into a three-tier scheme that reflects varying risk levels and provide shard-level provenance metadata to enable risk-aware usage. In controlled experiments using the open-sci-ref training protocol (fixed architectures and hyperparameters; 50B and 300B token budgets across 130M-1.7B parameters), models trained on MixtureVitae consistently outperform other permissive datasets across a suite of standard benchmarks, and at the 1.7B-parameters/300B-tokens setting, they surpass FineWeb-Edu and approach DCLM late in training. Performance is particularly strong on MMLU and on math and code benchmarks: a 1.7B model pretrained on 300B MixtureVitae tokens matches or exceeds a strong 1.7B instruction-tuned baseline on GSM8K, HumanEval, and MBPP, despite using over 36 times fewer tokens (300B vs. ~11T). Supported by a thorough decontamination analysis, these results show that permissive-first data with high instruction and reasoning density, tiered by licensing and provenance-related risk, can provide a practical and risk-mitigated foundation for training capable LLMs, reducing reliance on broad web scrapes without sacrificing competitiveness. Code: https://github.com/ontocord/mixturevitae

翻译：本文提出MixtureVitae——一个旨在最小化法律风险同时提供强大下游性能的开放访问预训练语料库。MixtureVitae采用许可优先、风险缓和的采集策略，将公共领域和许可授权文本（如CC-BY/Apache）与经审慎论证的低风险补充内容（如政府作品和欧盟文本与数据挖掘适用来源）相结合。该数据集采用简单的单阶段预训练方案，整合了高比例的许可合成指令与推理数据——这类信号通常在后训练阶段引入，且在许可网络语料中普遍稀缺。我们将所有来源按风险等级划分为三级分类体系，并提供分片级溯源元数据以支持风险感知使用。在采用开放科学参考训练协议（固定架构与超参数；130M-1.7B参数规模下50B和300B词元预算）的对照实验中，基于MixtureVitae训练的模型在一系列标准基准测试中持续优于其他许可数据集；在1.7B参数/300B词元的设定下，其表现超越FineWeb-Edu并在训练后期逼近DCLM。该模型在MMLU以及数学与代码基准测试中表现尤为突出：使用300B MixtureVitae词元预训练的1.7B模型在GSM8K、HumanEval和MBPP上达到或超越了强大的1.7B指令微调基线性能，尽管所用词元量减少超过36倍（300B对比约11T）。经由彻底的污染分析验证，这些结果表明：具有高指令与推理密度、按许可与溯源相关风险分级的许可优先数据，能够为训练高性能大语言模型提供实用且风险可控的基础，在保持竞争力的同时减少对广泛网络爬取数据的依赖。代码：https://github.com/ontocord/mixturevitae