The protection of Intellectual Property (IP) for Large Language Models (LLMs) has become a critical concern as model theft and unauthorized commercialization escalate. While adversarial fingerprinting offers a promising black-box solution for ownership verification, existing methods suffer from significant limitations: they are fragile against model modifications, sensitive to system prompt variations, and easily detectable due to high-perplexity input patterns. In this paper, we propose SRAF, which employs a multi-task adversarial optimization strategy that jointly optimizes fingerprints across homologous model variants and diverse chat templates, allowing the fingerprint to anchor onto invariant decision boundary features. Furthermore, we introduce a Perplexity Hiding technique that embeds adversarial perturbations within Markdown tables, effectively aligning the prompt's statistics with natural language to evade perplexity-based detection. Experiments on Llama-2 variants demonstrate SRAF's superior robustness and stealthiness compared to state-of-the-art baselines, offering a practical black-box solution for ownership verification.
翻译:随着模型窃取和未经授权商业化行为的加剧,大语言模型的知识产权保护已成为关键问题。虽然对抗指纹技术为所有权验证提供了一种前景广阔的黑盒解决方案,但现有方法存在明显局限:对模型修改脆弱、对系统提示词变化敏感,且因高困惑度输入模式而易被检测。本文提出SRAF方法,采用多任务对抗优化策略,在同类模型变体和多样化对话模板上联合优化指纹,使指纹能够锚定在不变的决策边界特征上。此外,我们引入困惑度隐藏技术,将对抗扰动嵌入Markdown表格中,有效使提示词的统计特征与自然语言对齐,从而规避基于困惑度的检测。在Llama-2系列模型上的实验表明,相较于现有先进基线方法,SRAF具有更优的鲁棒性和隐蔽性,为所有权验证提供了实用的黑盒解决方案。