Scaling data quantity is essential for large language models (LLMs), yet recent findings show that data quality can significantly boost performance and training efficiency. We introduce a German-language dataset curation pipeline that combines heuristic and model-based filtering techniques with synthetic data generation. We use our pipeline to create Aleph-Alpha-GermanWeb, a 628B-word German pre-training dataset composed of three subsets drawing from: (1) Common Crawl web data (organic subset; 78B words), (2) FineWeb2 (organic subset; 235B), and (3) synthetically-generated data conditioned on actual, organic web data (synthetic subset; 329B). We evaluate our dataset by pre-training both a 1B Llama-style model and an 8B tokeniser-free hierarchical autoregressive transformer (HAT) from scratch. A comparison on German-language benchmarks, including MMMLU, shows significant performance gains of Aleph-Alpha-GermanWeb over FineWeb2 alone. This advantage holds at the 8B scale even when FineWeb2 is enriched by human-curated high-quality data sources such as Wikipedia. Our findings support the growing body of evidence that model-based data curation and synthetic data generation can significantly enhance LLM pre-training datasets.
翻译:扩展数据量对大语言模型至关重要,但近期研究表明数据质量能显著提升模型性能与训练效率。我们提出一套德语数据集整理流程,该流程结合了基于启发式规则与模型的过滤技术及合成数据生成方法。利用该流程,我们构建了包含6280亿词符的德语预训练数据集Aleph-Alpha-GermanWeb,该数据集由三个子集构成:(1) Common Crawl网络数据(有机子集,780亿词符),(2) FineWeb2(有机子集,2350亿词符),(3) 基于真实有机网络数据生成的合成数据(合成子集,3290亿词符)。通过从头预训练10亿参数的Llama风格模型和80亿参数的无词符层级自回归Transformer(HAT),我们对数据集进行了评估。在MMMLU等德语基准测试上的对比显示,相较于单独使用FineWeb2,Aleph-Alpha-GermanWeb取得了显著性能提升。即使在FineWeb2中补充了维基百科等人为整理的高质量数据源后,该优势在80亿参数规模下依然成立。本研究表明,基于模型的数据整理与合成数据生成能显著增强大语言模型的预训练数据集,为相关领域日益增多的实证证据提供了有力支撑。