Large language models (LLMs) demonstrate general intelligence across a variety of machine learning tasks, thereby enhancing the commercial value of their intellectual property (IP). To protect this IP, model owners typically allow user access only in a black-box manner, however, adversaries can still utilize model extraction attacks to steal the model intelligence encoded in model generation. Watermarking technology offers a promising solution for defending against such attacks by embedding unique identifiers into the model-generated content. However, existing watermarking methods often compromise the quality of generated content due to heuristic alterations and lack robust mechanisms to counteract adversarial strategies, thus limiting their practicality in real-world scenarios. In this paper, we introduce an adaptive and robust watermarking method (named ModelShield) to protect the IP of LLMs. Our method incorporates a self-watermarking mechanism that allows LLMs to autonomously insert watermarks into their generated content to avoid the degradation of model content. We also propose a robust watermark detection mechanism capable of effectively identifying watermark signals under the interference of varying adversarial strategies. Besides, ModelShield is a plug-and-play method that does not require additional model training, enhancing its applicability in LLM deployments. Extensive evaluations on two real-world datasets and three LLMs demonstrate that our method surpasses existing methods in terms of defense effectiveness and robustness while significantly reducing the degradation of watermarking on the model-generated content.
翻译:大型语言模型(LLM)在多种机器学习任务中展现出通用智能,从而提升了其知识产权(IP)的商业价值。为保护此类IP,模型所有者通常仅以黑盒方式允许用户访问,然而,攻击者仍可利用模型提取攻击窃取模型生成中编码的模型智能。水印技术通过将唯一标识符嵌入模型生成内容,为防御此类攻击提供了有前景的解决方案。然而,现有水印方法常因启发式修改而损害生成内容的质量,且缺乏应对对抗策略的鲁棒机制,从而限制了其在真实场景中的实用性。本文提出一种自适应且鲁棒的水印方法(命名为ModelShield)以保护LLM的IP。我们的方法包含一种自水印机制,使LLM能够自主将水印插入其生成内容中,以避免模型内容质量的下降。我们还提出一种鲁棒的水印检测机制,能够在不同对抗策略的干扰下有效识别水印信号。此外,ModelShield是一种即插即用方法,无需额外模型训练,增强了其在LLM部署中的适用性。在两个真实数据集和三个LLM上的广泛评估表明,我们的方法在防御效果和鲁棒性方面优于现有方法,同时显著降低了水印对模型生成内容质量的损害。