Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data?

The pretraining data of today's strongest language models is opaque; in particular, little is known about the proportions of various domains or languages represented. In this work, we tackle a task which we call data mixture inference, which aims to uncover the distributional make-up of training data. We introduce a novel attack based on a previously overlooked source of information -- byte-pair encoding (BPE) tokenizers, used by the vast majority of modern language models. Our key insight is that the ordered list of merge rules learned by a BPE tokenizer naturally reveals information about the token frequencies in its training data: the first merge is the most common byte pair, the second is the most common pair after merging the first token, and so on. Given a tokenizer's merge list along with data samples for each category of interest, we formulate a linear program that solves for the proportion of each category in the tokenizer's training set. Importantly, to the extent to which tokenizer training data is representative of the pretraining data, we indirectly learn about pretraining data. In controlled experiments, we show that our attack recovers mixture ratios with high precision for tokenizers trained on known mixtures of natural languages, programming languages, and data sources. We then apply our approach to off-the-shelf tokenizers released with recent LMs. We confirm much publicly disclosed information about these models, and also make several new inferences: GPT-4o's tokenizer is much more multilingual than its predecessors, training on 39% non-English data; Llama3 extends GPT-3.5's tokenizer primarily for multilingual (48%) use; GPT-3.5's and Claude's tokenizers are trained on predominantly code (~60%). We hope our work sheds light on current design practices for pretraining data, and inspires continued research into data mixture inference for LMs.

翻译：当今最强大语言模型的预训练数据是不透明的；特别是，关于其包含的各种领域或语言比例的信息鲜为人知。在本研究中，我们解决了一项称为数据混合推断的任务，旨在揭示训练数据的分布构成。我们提出了一种基于先前被忽视信息源的新型攻击方法——字节对编码（BPE）分词器，绝大多数现代语言模型都使用该技术。我们的核心洞见是：BPE分词器学习到的有序合并规则列表天然揭示了其训练数据中词频分布信息：首次合并对应最常出现的字节对，第二次合并对应首次合并后最常出现的字节对，依此类推。给定分词器的合并列表以及各目标类别的数据样本，我们构建了一个线性规划模型来求解分词器训练集中各类别的比例。重要的是，在分词器训练数据能够代表预训练数据的范围内，我们间接获得了预训练数据的信息。在受控实验中，我们证明该攻击方法能够以高精度恢复基于已知混合比例的自然语言、编程语言及数据源训练的分词器的混合比率。随后我们将该方法应用于近期发布语言模型的现成分词器。我们验证了这些模型大量已公开的信息，同时得出若干新推断：GPT-4o的分词器较前代更具多语言特性，其训练数据包含39%非英语内容；Llama3主要针对多语言用途（48%）扩展了GPT-3.5的分词器；GPT-3.5与Claude的分词器训练数据主要包含代码（约60%）。我们希望本研究能揭示当前预训练数据的设计实践，并激发对语言模型数据混合推断的持续研究。