Protecting the intellectual property of open-weight large language models (LLMs) requires verifying whether a suspect model is derived from a victim model despite common laundering operations such as fine-tuning (including PPO/DPO), pruning/compression, and model merging. We propose \textsc{AttnDiff}, a data-efficient white-box framework that extracts fingerprints from models via intrinsic information-routing behavior. \textsc{AttnDiff} probes minimally edited prompt pairs that induce controlled semantic conflicts, captures differential attention patterns, summarizes them with compact spectral descriptors, and compares models using CKA. Across Llama-2/3 and Qwen2.5 (3B--14B) and additional open-source families, it yields high similarity for related derivatives while separating unrelated model families (e.g., $>0.98$ vs.\ $<0.22$ with $M=60$ probes). With 5--60 multi-domain probes, it supports practical provenance verification and accountability.
翻译:保护开源权重大语言模型的知识产权需要验证嫌疑模型是否源自受害模型,尽管可能经过微调(包括PPO/DPO)、剪枝/压缩及模型融合等常见清洗操作。本文提出\textsc{AttnDiff}——一种数据高效的白盒框架,通过内在信息路由行为提取模型指纹。\textsc{AttnDiff}探测经过最小化编辑的成对提示以诱发可控语义冲突,捕获差异注意力模式,利用紧凑谱描述子对其进行归纳总结,并通过CKA实现模型间的比较。在Llama-2/3和Qwen2.5(3B–14B)及其他开源模型族上的实验表明:该方法对相关衍生模型呈现高相似度,同时有效区分无关模型族(例如,在$M=60$个探测条件下,相似度$>0.98$ vs.\ $<0.22$)。通过5–60个多领域探测,该方法支持实用的出处验证和问责机制。