Protecting the intellectual property of large language models (LLMs) is a critical challenge due to the proliferation of unauthorized derivative models. We introduce a novel fingerprinting framework that leverages the behavioral patterns induced by safety alignment, applying the concept of refusal vectors for LLM provenance tracking. These vectors, extracted from directional patterns in a model's internal representations when processing harmful versus harmless prompts, serve as robust behavioral fingerprints. Our contribution lies in developing a fingerprinting system around this concept and conducting extensive validation of its effectiveness for IP protection. We demonstrate that these behavioral fingerprints are highly robust against common modifications, including finetunes, merges, and quantization. Our experiments show that the fingerprint is unique to each model family, with low cosine similarity between independently trained models. In a large-scale identification task across 76 offspring models, our method achieves 100\% accuracy in identifying the correct base model family. Furthermore, we analyze the fingerprint's behavior under alignment-breaking attacks, finding that while performance degrades significantly, detectable traces remain. Finally, we propose a theoretical framework to transform this private fingerprint into a publicly verifiable, privacy-preserving artifact using locality-sensitive hashing and zero-knowledge proofs.
翻译:保护大语言模型(LLM)的知识产权是一项关键挑战,因为未经授权的衍生模型正在激增。我们提出了一种新颖的指纹识别框架,该框架利用安全对齐所诱导的行为模式,应用拒绝向量的概念进行LLM溯源追踪。这些向量通过提取模型在处理有害与无害提示时内部表征中的方向性模式而获得,可作为鲁棒的行为指纹。我们的贡献在于围绕这一概念开发了一套指纹识别系统,并对其在知识产权保护方面的有效性进行了广泛验证。我们证明,这些行为指纹对于常见的模型修改(包括微调、合并和量化)具有高度鲁棒性。实验表明,该指纹对每个模型系列具有唯一性,独立训练的模型之间余弦相似度较低。在涵盖76个衍生模型的大规模识别任务中,我们的方法在识别正确的基础模型系列上达到了100%的准确率。此外,我们分析了指纹在破坏对齐攻击下的行为,发现虽然性能显著下降,但仍可检测到残留痕迹。最后,我们提出了一个理论框架,通过局部敏感哈希和零知识证明,将这种私有指纹转化为可公开验证且保护隐私的凭证。