Large language models (LLMs) are often modified after release through post-processing such as post-training or quantization, which makes it challenging to determine whether one model is derived from another. Existing provenance detection methods have two main limitations: (1) they embed signals into the base model before release, which is infeasible for already published models, or (2) they compare outputs across models using hand-crafted or random prompts, which are not robust to post-processing. In this work, we propose LLMPrint, a novel detection framework that constructs fingerprints by exploiting LLMs' inherent vulnerability to prompt injection. Our key insight is that by optimizing fingerprint prompts to enforce consistent token preferences, we can obtain fingerprints that are both unique to the base model and robust to post-processing. We further develop a unified verification procedure that applies to both gray-box and black-box settings, with statistical guarantees. We evaluate LLMPrint on five base models and around 700 post-trained or quantized variants. Our results show that LLMPrint achieves high true positive rates while keeping false positive rates near zero. The code is publicly available at https://github.com/hifi-hyp/ACL-LLMPrint.
翻译:大型语言模型(LLM)在发布后常通过后训练或量化等后处理方式进行修改,这使得判断一个模型是否源自另一个模型变得困难。现有源检测方法存在两个主要局限:(1) 需在发布前将信号嵌入基础模型,这对已发布模型不可行;(2) 通过人工构造或随机提示比较模型输出,此类方法对后处理不鲁棒。本研究提出LLMPrint——一种利用LLM对提示注入固有脆弱性来构建指纹的新型检测框架。我们的核心洞察是:通过优化指纹提示以强制实现一致的令牌偏好,可获得既具有基础模型独特性又对后处理鲁棒的指纹。我们进一步开发了统一的验证流程,适用于灰盒与黑盒两种设置,并具备统计保障。我们在五个基础模型及约700个后训练或量化变体上评估了LLMPrint。结果表明,LLMPrint在保持假阳性率接近于零的同时达到了高真阳性率。代码开源于https://github.com/hifi-hyp/ACL-LLMPrint。