We consider the following problem: given the weights of two models, can we test whether they were trained independently -- i.e., from independent random initializations? We consider two settings: constrained and unconstrained. In the constrained setting, we make assumptions about model architecture and training and propose a family of statistical tests that yield exact p-values with respect to the null hypothesis that the models are trained from independent random initializations. These p-values are valid regardless of the composition of either model's training data; we compute them by simulating exchangeable copies of each model under our assumptions and comparing various similarity measures of weights and activations between the original two models versus these copies. We report the p-values from these tests on pairs of 21 open-weight models (210 total pairs) and correctly identify all pairs of non-independent models. Our tests remain effective even if one model was fine-tuned for many tokens. In the unconstrained setting, where we make no assumptions about training procedures, can change model architecture, and allow for adversarial evasion attacks, the previous tests no longer work. Instead, we propose a new test which matches hidden activations between two models, and which is robust to adversarial transformations and to changes in model architecture. The test can also do localized testing: identifying specific non-independent components of models. Though we no longer obtain exact p-values from this, empirically we find it behaves as one and reliably identifies non-independent models. Notably, we can use the test to identify specific parts of one model that are derived from another (e.g., how Llama 3.1-8B was pruned to initialize Llama 3.2-3B, or shared layers between Mistral-7B and StripedHyena-7B), and it is even robust to retraining individual layers of either model from scratch.
翻译:我们考虑以下问题:给定两个模型的权重,能否检验它们是否独立训练——即是否源于独立的随机初始化?我们考虑两种设定:约束设定与非约束设定。在约束设定中,我们对模型架构与训练过程做出假设,并提出一系列统计检验方法,这些方法能针对"模型源于独立随机初始化"的零假设给出精确的p值。这些p值的有效性不受任一模型训练数据构成的影响;我们通过模拟符合假设条件下每个模型的可交换副本,并比较原始两个模型与这些副本之间在权重和激活值上的各种相似性度量来计算p值。我们在21个开源权重模型的所有配对(共210对)上报告了这些检验的p值,并正确识别了所有非独立模型对。即使某个模型经过大量令牌的微调,我们的检验方法依然有效。在非约束设定中,我们对训练过程不作任何假设,允许改变模型架构,并考虑对抗性规避攻击,前述检验方法不再适用。为此,我们提出一种新的检验方法,该方法通过匹配两个模型的隐藏激活值来实现,并对对抗性变换和模型架构变更具有鲁棒性。该检验还能进行局部化测试:识别模型中特定的非独立组件。虽然我们无法从此方法获得精确的p值,但实证表明其表现符合p值特性,并能可靠识别非独立模型。值得注意的是,我们可以利用该检验识别一个模型中源自另一模型的具体部分(例如Llama 3.1-8B如何通过剪枝初始化Llama 3.2-3B,或Mistral-7B与StripedHyena-7B之间的共享层),即使对任一模型的单个层进行重新训练,该方法仍保持鲁棒性。