Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data?

The pretraining data of today's strongest language models is opaque; in particular, little is known about the proportions of various domains or languages represented. In this work, we tackle a task which we call data mixture inference, which aims to uncover the distributional make-up of training data. We introduce a novel attack based on a previously overlooked source of information: byte-pair encoding (BPE) tokenizers, used by the vast majority of modern language models. Our key insight is that the ordered list of merge rules learned by a BPE tokenizer naturally reveals information about the token frequencies in its training data. Given a tokenizer's merge list along with example data for each category of interest, we formulate a linear program that solves for the proportion of each category in the tokenizer's training set. In controlled experiments, we show that our attack recovers mixture ratios with high precision for tokenizers trained on known mixtures of natural languages, programming languages, and data sources. We then apply our approach to off-the-shelf tokenizers released with recent LMs. We confirm much publicly disclosed information about these models, and also make several new inferences: GPT-4o and Mistral NeMo's tokenizers are much more multilingual than their predecessors, training on 39% and 47% non-English language data, respectively; Llama 3 extends GPT-3.5's tokenizer primarily for multilingual (48%) use; GPT-3.5's and Claude's tokenizers are trained on predominantly code (~60%). We hope our work sheds light on current design practices for pretraining data, and inspires continued research into data mixture inference for LMs.

翻译：当今最强大语言模型的预训练数据是不透明的；特别是，关于其所代表的各种领域或语言的比例知之甚少。在这项工作中，我们解决了一个我们称之为数据混合推断的任务，其目标是揭示训练数据的分布构成。我们引入了一种基于先前被忽视的信息来源的新型攻击方法：字节对编码（BPE）分词器，绝大多数现代语言模型都使用它。我们的核心洞察是，BPE分词器学习到的有序合并规则列表自然地揭示了其训练数据中词元频率的信息。给定一个分词器的合并列表以及每个感兴趣类别的示例数据，我们构建了一个线性规划问题，以求解该分词器训练集中每个类别的比例。在受控实验中，我们展示了我们的攻击能够以高精度恢复在已知的自然语言、编程语言和数据源混合数据上训练的分词器的混合比例。随后，我们将该方法应用于近期发布的语言模型所附带的现成分词器。我们证实了关于这些模型的许多公开披露信息，并得出了一些新的推断：GPT-4o和Mistral NeMo的分词器比其前代产品更加多语言化，分别训练于39%和47%的非英语语言数据；Llama 3扩展GPT-3.5的分词器主要用于多语言（48%）用途；GPT-3.5和Claude的分词器主要在代码数据（约60%）上训练。我们希望我们的工作能够揭示当前预训练数据的设计实践，并激发对语言模型数据混合推断的持续研究。