Protecting the copyright of large language models (LLMs) has become crucial due to their resource-intensive training and accompanying carefully designed licenses. However, identifying the original base model of an LLM is challenging due to potential parameter alterations. In this study, we introduce a human-readable fingerprint for LLMs that uniquely identifies the base model without exposing model parameters or interfering with training. We first observe that the vector direction of LLM parameters remains stable after the model has converged during pretraining, showing negligible perturbations through subsequent training steps, including continued pretraining, supervised fine-tuning (SFT), and RLHF, which makes it a sufficient condition to identify the base model. The necessity is validated by continuing to train an LLM with an extra term to drive away the model parameters' direction and the model becomes damaged. However, this direction is vulnerable to simple attacks like dimension permutation or matrix rotation, which significantly change it without affecting performance. To address this, leveraging the Transformer structure, we systematically analyze potential attacks and define three invariant terms that identify an LLM's base model. We make these invariant terms human-readable by mapping them to a Gaussian vector using a convolutional encoder and then converting it into a natural image with StyleGAN2. Our method generates a dog image as an identity fingerprint for an LLM, where the dog's appearance strongly indicates the LLM's base model. The fingerprint provides intuitive information for qualitative discrimination, while the invariant terms can be employed for quantitative and precise verification. Experimental results across various LLMs demonstrate the effectiveness of our method.
翻译:保护大语言模型的版权因其资源密集型的训练过程及伴随的精心设计的许可证而变得至关重要。然而,由于模型参数的潜在变更,识别大语言模型的原始基础模型具有挑战性。在本研究中,我们引入了一种面向大语言模型的人可读指纹,该指纹能在不暴露模型参数或干扰训练的情况下唯一标识基础模型。我们首先观察到,在预训练过程中模型收敛后,大语言模型参数的向量方向保持稳定,并在后续的训练步骤(包括持续预训练、监督微调(SFT)和RLHF)中表现出可忽略的扰动,这使其成为识别基础模型的充分条件。通过额外添加一项以驱动模型参数方向偏离的方式继续训练大语言模型,模型会受损,从而验证了该条件的必要性。然而,这种方向容易受到简单的攻击,如维度置换或矩阵旋转,这些操作会在不影响性能的情况下显著改变方向。为解决这一问题,我们利用Transformer结构,系统分析了潜在攻击,并定义了三个不变项来识别大语言模型的基础模型。我们通过卷积编码器将这些不变项映射为高斯向量,然后利用StyleGAN2将其转换为自然图像,从而使其人可读。我们的方法为每个大语言模型生成一张狗图像作为身份指纹,其中狗的外貌强烈指示该模型的基础模型。该指纹为定性判别提供了直观信息,而不变项则可用于定量和精确验证。在多种大语言模型上的实验结果表明了本方法的有效性。