Suppose Alice trains an open-weight language model and Bob uses a blackbox derivative of Alice's model to produce text. Can Alice prove that Bob is using her model, either by querying Bob's derivative model (query setting) or from the text alone (observational setting)? We formulate this question as an independence testing problem--in which the null hypothesis is that Bob's model or text is independent of Alice's randomized training run--and investigate it through the lens of palimpsestic memorization in language models: models are more likely to memorize data seen later in training, so we can test whether Bob is using Alice's model using test statistics that capture correlation between Bob's model or text and the ordering of training examples in Alice's training run. If Alice has randomly shuffled her training data, then any significant correlation amounts to exactly quantifiable statistical evidence against the null hypothesis, regardless of the composition of Alice's training data. In the query setting, we directly estimate (via prompting) the likelihood Bob's model gives to Alice's training examples and order; we correlate the likelihoods of over 40 fine-tunes of various Pythia and OLMo base models ranging from 1B to 12B parameters with the base model's training data order, achieving a p-value on the order of at most 1e-8 in all but six cases. In the observational setting, we try two approaches based on estimating 1) the likelihood of Bob's text overlapping with spans of Alice's training examples and 2) the likelihood of Bob's text with respect to different versions of Alice's model we obtain by repeating the last phase (e.g., 1%) of her training run on reshuffled data. The second approach can reliably distinguish Bob's text from as little as a few hundred tokens; the first does not involve any retraining but requires many more tokens (several hundred thousand) to achieve high power.
翻译:假设Alice训练了一个公开权重的语言模型,而Bob使用Alice模型的黑盒衍生版本来生成文本。Alice能否证明Bob正在使用她的模型?这可以通过查询Bob的衍生模型(查询设定)或仅从文本本身(观测设定)来实现。我们将此问题形式化为一个独立性检验问题——其中零假设为Bob的模型或文本与Alice的随机化训练过程相互独立——并通过语言模型中重写记忆的视角进行研究:模型更可能记忆训练后期出现的数据,因此我们可以通过捕捉Bob的模型/文本与Alice训练样本顺序之间相关性的检验统计量,来验证Bob是否使用了Alice的模型。若Alice已对其训练数据进行随机打乱,则任何显著相关性都将构成可精确量化的、反对零假设的统计证据,且该结论与Alice训练数据的具体构成无关。在查询设定中,我们直接通过提示估计Bob模型对Alice训练样本及其顺序的似然值;我们将超过40个基于Pythia和OLMo(参数量1B至12B)的微调模型的似然值与基模型训练数据顺序进行相关性分析,除六种情况外均获得p值不超过1e-8量级的结果。在观测设定中,我们尝试了两种方法:1)估计Bob文本与Alice训练样本片段重叠的似然值;2)估计Bob文本相对于Alice模型不同版本的似然值,这些版本通过对其训练最后阶段(如1%)使用重排数据重复训练获得。第二种方法仅需数百个标记即可可靠区分Bob的文本;第一种方法无需任何重新训练,但需要数十万个标记才能达到较高的检验功效。