HuRef: HUman-REadable Fingerprint for Large Language Models

Protecting the copyright of large language models (LLMs) has become crucial due to their resource-intensive training and accompanying carefully designed licenses. However, identifying the original base model of an LLM is challenging due to potential parameter alterations through fine-tuning or continued pretraining. In this study, we introduce HuRef, a human-readable fingerprint for LLMs that uniquely identifies the base model without exposing model parameters or interfering with training. We first observe that the vector direction of LLM parameters remains stable after the model has converged during pretraining, showing negligible perturbations through subsequent training steps, including continued pretraining, supervised fine-tuning (SFT), and RLHF, which makes it a sufficient condition to identify the base model. The necessity is validated by continuing to train an LLM with an extra term to drive away the model parameters' direction and the model becomes damaged. However, this direction is vulnerable to simple attacks like dimension permutation or matrix rotation, which significantly change it without affecting performance. To address this, leveraging the Transformer structure, we systematically analyze potential attacks and define three invariant terms that identify an LLM's base model. We make these invariant terms human-readable by mapping them to a Gaussian vector using a convolutional encoder and then converting it into a natural image with StyleGAN2. Our method generates a dog image as an identity fingerprint for an LLM, where the dog's appearance strongly indicates the LLM's base model. Experimental results across various LLMs demonstrate the effectiveness of our method, the generated dog image remains invariant to different training steps, including SFT, RLHF, or even continued pretraining with augmented vocabulary in a new language.

翻译：保护大型语言模型（LLM）的版权因其资源密集型的训练过程及随之精心设计的许可证而变得至关重要。然而，由于通过微调或持续预训练可能改变参数，识别LLM的原始基础模型极具挑战性。在本研究中，我们提出HuRef——一种面向LLM的人类可读指纹，可在不暴露模型参数或干扰训练的情况下唯一标识基础模型。我们首先观察到，LLM参数在预训练收敛后其向量方向保持稳定，后续训练步骤（包括持续预训练、监督微调（SFT）和RLHF）仅造成微不足道的扰动，这使其成为标识基础模型的充分条件。通过额外引入一项使模型参数方向偏离的术语继续训练LLM，模型遭到破坏，从而验证了该方向标识的必要性。然而，该方向易受简单攻击（如维度置换或矩阵旋转）的影响，这些攻击会显著改变方向而不影响模型性能。为解决此问题，我们利用Transformer结构系统分析了潜在攻击，并定义了三个用于标识LLM基础模型的不变量。我们通过卷积编码器将这些不变量映射为高斯向量，再借助StyleGAN2将其转换为自然图像，从而实现人类可读化。我们的方法为LLM生成一幅狗图像作为身份指纹，狗的外观强烈指示LLM的基础模型。针对多种LLM的实验结果表明，该方法效果显著：生成的狗图像对不同训练步骤（包括SFT、RLHF，甚至对新语言中增强词汇的持续预训练）保持不变。

相关内容

大语言模型

关注 67

大语言模型是基于海量文本数据训练的深度学习模型。它不仅能够生成自然语言文本，还能够深入理解文本含义，处理各种自然语言任务，如文本摘要、问答、翻译等。2023年，大语言模型及其在人工智能领域的应用已成为全球科技研究的热点，其在规模上的增长尤为引人注目，参数量已从最初的十几亿跃升到如今的一万亿。参数量的提升使得模型能够更加精细地捕捉人类语言微妙之处，更加深入地理解人类语言的复杂性。在过去的一年里，大语言模型在吸纳新知识、分解复杂任务以及图文对齐等多方面都有显著提升。随着技术的不断成熟，它将不断拓展其应用范围，为人类提供更加智能化和个性化的服务，进一步改善人们的生活和生产方式。

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日