Enhancing Multilingual LLM Pretraining with Model-Based Data Selection

Dataset curation has become a basis for strong large language model (LLM) performance. While various rule-based filtering heuristics exist for English and multilingual datasets, model-based filtering techniques have primarily focused on English. To address the disparity stemming from limited research on non-English languages, we develop a model-based filtering framework for multilingual datasets that aims to identify a diverse set of structured and knowledge-rich samples. Our approach emphasizes transparency, simplicity, and efficiency, leveraging Transformer- and FastText-based classifiers to ensure the broad accessibility of our technique and data. We conduct comprehensive ablation studies on the FineWeb-2 web crawl dataset across diverse language families, scripts, and resource availability to demonstrate the effectiveness of our method. Training a 1B-parameter Llama model for 70B and 119B tokens, our approach can match the baseline MMLU score with as little as 15% of the training tokens, while also improving across other benchmarks and mitigating the curse of multilinguality. These findings provide strong evidence for the generalizability of our approach to other languages. As a result, we extend our framework to 20 languages for which we release the refined pretraining datasets.

翻译：数据集构建已成为实现强大大语言模型性能的基础。尽管针对英语和多语言数据集存在多种基于规则的过滤启发式方法，但基于模型的过滤技术主要聚焦于英语。为应对因非英语语言研究有限而产生的差异，我们开发了一个面向多语言数据集的基于模型过滤框架，旨在识别一组多样化的结构化且富含知识的样本。我们的方法强调透明度、简洁性和效率，利用基于Transformer和FastText的分类器来确保技术与数据的广泛可及性。我们在涵盖不同语系、文字系统和资源可用性的FineWeb-2网络爬取数据集上进行了全面的消融研究，以证明我们方法的有效性。通过使用70B和119B个训练词元训练一个10亿参数的Llama模型，我们的方法仅需15%的训练词元即可达到基线MMLU分数，同时在其他基准测试中也有所提升，并缓解了多语言诅咒。这些发现为我们方法向其他语言的泛化提供了有力证据。因此，我们将框架扩展到20种语言，并发布了相应的精炼预训练数据集。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

使用多模态大语言模型进行深度学习的图像、文本和语音数据增强：综述

专知会员服务

28+阅读 · 2025年2月4日

大语言模型训练数据

专知会员服务

72+阅读 · 2024年11月22日

《大语言模型的数据合成与增强综述》

专知会员服务

43+阅读 · 2024年10月19日

Llama-3-SynE：实现有效且高效的大语言模型持续预训练

专知会员服务

36+阅读 · 2024年7月30日