To protect the intellectual property of well-trained deep neural networks (DNNs), black-box DNN watermarks, which are embedded into the prediction behavior of DNN models on a set of specially-crafted samples, have gained increasing popularity in both academy and industry. Watermark robustness is usually implemented against attackers who steal the protected model and obfuscate its parameters for watermark removal. Recent studies empirically prove the robustness of most black-box watermarking schemes against known removal attempts. In this paper, we propose a novel Model Inversion-based Removal Attack (\textsc{Mira}), which is watermark-agnostic and effective against most of mainstream black-box DNN watermarking schemes. In general, our attack pipeline exploits the internals of the protected model to recover and unlearn the watermark message. We further design target class detection and recovered sample splitting algorithms to reduce the utility loss caused by \textsc{Mira} and achieve data-free watermark removal on half of the watermarking schemes. We conduct comprehensive evaluation of \textsc{Mira} against ten mainstream black-box watermarks on three benchmark datasets and DNN architectures. Compared with six baseline removal attacks, \textsc{Mira} achieves strong watermark removal effects on the covered watermarks, preserving at least $90\%$ of the stolen model utility, under more relaxed or even no assumptions on the dataset availability.
翻译:为保护训练良好的深度神经网络的知识产权,嵌入在模型对特定样本预测行为中的黑盒DNN水印在学术界和工业界日益普及。水印鲁棒性通常用于防御攻击者窃取受保护模型并通过混淆参数移除水印。最新研究通过实验证明了大多数黑盒水印方案对已知移除攻击的鲁棒性。本文提出一种新型的模型反演移除攻击(\textsc{Mira}),该攻击与具体水印无关,且对主流黑盒DNN水印方案具有广泛有效性。总体而言,我们的攻击流程通过挖掘受保护模型内部机制来恢复并遗忘水印信息。我们进一步设计了目标类别检测和恢复样本分割算法,以降低\textsc{Mira}带来的效用损失,并在半数水印方案上实现了无数据依赖的水印移除。我们在三个基准数据集和DNN架构上,针对十种主流黑盒水印方案对\textsc{Mira}进行了全面评估。与六种基线移除攻击相比,在数据集可用性条件更宽松甚至无任何假设的情况下,\textsc{Mira}对覆盖的所有水印方案均实现了强效移除效果,同时至少保留受窃模型$90\%$的效用。