Dataset curation has become a basis for strong large language model (LLM) performance. While various rule-based filtering heuristics exist for English and multilingual datasets, model-based filtering techniques have primarily focused on English. To address the disparity stemming from limited research on non-English languages, we develop a model-based filtering framework for multilingual datasets that aims to identify a diverse set of structured and knowledge-rich samples. Our approach emphasizes transparency, simplicity, and efficiency, leveraging Transformer- and FastText-based classifiers to ensure the broad accessibility of our technique and data. We conduct comprehensive ablation studies on the FineWeb-2 web crawl dataset across diverse language families, scripts, and resource availability to demonstrate the effectiveness of our method. Training a 1B-parameter Llama model for 70B and 119B tokens, our approach can match the baseline MMLU score with as little as 15% of the training tokens, while also improving across other benchmarks and mitigating the curse of multilinguality. These findings provide strong evidence for the generalizability of our approach to other languages. As a result, we extend our framework to 20 languages for which we release the refined pretraining datasets.
翻译:数据集构建已成为实现强大大语言模型性能的基础。尽管针对英语和多语言数据集存在多种基于规则的过滤启发式方法,但基于模型的过滤技术主要聚焦于英语。为应对因非英语语言研究有限而产生的差异,我们开发了一个面向多语言数据集的基于模型过滤框架,旨在识别一组多样化的结构化且富含知识的样本。我们的方法强调透明度、简洁性和效率,利用基于Transformer和FastText的分类器来确保技术与数据的广泛可及性。我们在涵盖不同语系、文字系统和资源可用性的FineWeb-2网络爬取数据集上进行了全面的消融研究,以证明我们方法的有效性。通过使用70B和119B个训练词元训练一个10亿参数的Llama模型,我们的方法仅需15%的训练词元即可达到基线MMLU分数,同时在其他基准测试中也有所提升,并缓解了多语言诅咒。这些发现为我们方法向其他语言的泛化提供了有力证据。因此,我们将框架扩展到20种语言,并发布了相应的精炼预训练数据集。