Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data?

The pretraining data of today's strongest language models is opaque. In particular, little is known about the proportions of various domains or languages represented. In this work, we tackle a task which we call data mixture inference, which aims to uncover the distributional make-up of training data. We introduce a novel attack based on a previously overlooked source of information -- byte-pair encoding (BPE) tokenizers, used by the vast majority of modern language models. Our key insight is that the ordered list of merge rules learned by a BPE tokenizer naturally reveals information about the token frequencies in its training data: the first merge is the most common byte pair, the second is the most common pair after merging the first token, and so on. Given a tokenizer's merge list along with data samples for each category of interest, we formulate a linear program that solves for the proportion of each category in the tokenizer's training set. Importantly, to the extent to which tokenizer training data is representative of the pretraining data, we indirectly learn about the pretraining data. In controlled experiments, we show that our attack recovers mixture ratios with high precision for tokenizers trained on known mixtures of natural languages, programming languages, and data sources. We then apply our approach to off-the-shelf tokenizers released with recent LMs. We confirm much publicly disclosed information about these models, and also make several new inferences: GPT-4o's tokenizer is much more multilingual than its predecessors, training on 39% non-English data; Llama3 extends GPT-3.5's tokenizer primarily for multilingual (48%) use; GPT-3.5's and Claude's tokenizers are trained on predominantly code (~60%). We hope our work sheds light on current design practices for pretraining data, and inspires continued research into data mixture inference for LMs.

翻译：当今最强语言模型的预训练数据是不透明的。具体而言，关于其训练数据中各类领域或语言所占比例的信息鲜为人知。在本研究中，我们致力于解决一项称为数据混合推断的任务，旨在揭示训练数据的分布构成。我们提出了一种基于先前被忽视的信息源的新型攻击方法——字节对编码（BPE）分词器，该分词器被绝大多数现代语言模型所采用。我们的核心洞见在于：BPE分词器学习到的有序合并规则列表天然地揭示了其训练数据中的词频信息：首次合并的是最常见的字节对，第二次合并的是在首次合并后最常见的字节对，依此类推。给定分词器的合并列表以及各目标类别的数据样本，我们构建了一个线性规划模型来求解分词器训练集中各类别的比例。重要的是，在分词器训练数据能够代表预训练数据的范围内，我们可以间接了解预训练数据的构成。在受控实验中，我们证明该攻击方法能够以高精度还原基于已知混合比例的自然语言、编程语言及数据源训练的分词器的混合比例。随后，我们将该方法应用于近期发布的语言模型所附带的现成分词器。我们验证了这些模型已公开的大量信息，并得出若干新推断：GPT-4o的分词器较其前代模型更具多语言特性，其训练数据中非英语数据占比达39%；Llama3对GPT-3.5分词器的扩展主要服务于多语言用途（占比48%）；GPT-3.5与Claude的分词器训练数据以代码为主（约60%）。我们希望本研究能揭示当前预训练数据的设计实践，并激发针对语言模型数据混合推断的持续研究。