Given the high cost of large language model (LLM) training from scratch, safeguarding LLM intellectual property (IP) has become increasingly crucial. As the standard paradigm for IP ownership verification, LLM fingerprinting thus plays a vital role in addressing this challenge. Existing LLM fingerprinting methods verify ownership by extracting or injecting model-specific features. However, they overlook potential attacks during the verification process, leaving them ineffective when the model thief fully controls the LLM's inference process. In such settings, attackers may share prompt-response pairs to enable fingerprint unlearning or manipulate outputs to evade exact-match verification. We propose iSeal, the first fingerprinting method designed for reliable verification when the model thief controls the suspected LLM in an end-to-end manner. It injects unique features into both the model and an external module, reinforced by an error-correction mechanism and a similarity-based verification strategy. These components are resistant to verification-time attacks, including collusion-based fingerprint unlearning and response manipulation, backed by both theoretical analysis and empirical results. iSeal achieves 100 percent Fingerprint Success Rate (FSR) on 12 LLMs against more than 10 attacks, while baselines fail under unlearning and response manipulations.
翻译:鉴于从头训练大型语言模型(LLM)的高昂成本,保护LLM知识产权(IP)已变得至关重要。作为IP所有权验证的标准范式,LLM指纹技术因此成为应对该挑战的核心手段。现有LLM指纹方法通过提取或注入模型专属特征来验证所有权,但忽略了验证过程中可能遭受的攻击——当模型盗取者完全控制LLM推理过程时,这些方法将失效。在此类场景下,攻击者可共享提示-响应对以实现指纹遗忘,或操纵输出以规避精确匹配验证。我们提出iSeal——首个在模型盗取者端到端控制疑似LLM场景下,实现可靠验证的指纹方法。该方法将独特特征同时注入模型与外部模块,并通过纠错机制与基于相似性的验证策略增强鲁棒性。结合理论分析与实证结果,这些组件能够抵御包括合谋式指纹遗忘与响应操纵在内的验证时攻击。在面向12个LLM的10余种攻击实验中,iSeal实现了100%的指纹成功率(FSR),而基线方法在遗忘与响应操纵攻击下均告失效。