Model watermarking safeguards AI model intellectual property by embedding distinctive knowledge that induces unique behavioral signatures. The primary technical challenge lies in ensuring watermark robustness against various post-processing attacks on the watermarked model. Model extraction attacks emerge as the most severe threat, where adversaries exploit prediction outputs to train surrogate models that illegally replicate the original model's functionality. In this work, we propose a rehearsal-based watermark embedding framework to enhance the robustness of model watermarks against model extraction attacks. By simulating the extraction process, our method leverages the loss of a \textit{simulated stolen model} on a trigger set as a training signal to fine-tune the watermark knowledge within the target model. This fine-tuning step encourages the watermark to be embedded in a way that boosts transferability, thereby increasing its chances of persisting and remaining detectable in stolen models. Comprehensive experiments conducted under diverse settings demonstrate that the proposed method significantly improves the robustness of model watermarks against both model extraction and subsequent watermark removal attacks.
翻译:模型水印通过嵌入能诱发独特行为特征的知识来保护AI模型的知识产权。其核心技术挑战在于确保水印对经水印模型遭受的各种后处理攻击具有鲁棒性。模型提取攻击是最严峻的威胁,攻击者利用预测输出来训练替代模型,非法复制原始模型的功能。本文提出一种基于排练的水印嵌入框架,以增强模型水印对模型提取攻击的鲁棒性。通过模拟提取过程,我们的方法利用模拟窃取模型在触发集上的损失作为训练信号,对目标模型中的水印知识进行微调。这一微调步骤促使水印以增强可转移性的方式嵌入,从而提高其在窃取模型中持续存在并保持可检测性的概率。在多种设置下进行的全面实验表明,所提方法显著提升了模型水印对模型提取及后续水印移除攻击的鲁棒性。