Large language models (LLMs) demonstrate general intelligence across a variety of machine learning tasks, thereby enhancing the commercial value of their intellectual property (IP). To protect this IP, model owners typically allow user access only in a black-box manner, however, adversaries can still utilize model extraction attacks to steal the model intelligence encoded in model generation. Watermarking technology offers a promising solution for defending against such attacks by embedding unique identifiers into the model-generated content. However, existing watermarking methods often compromise the quality of generated content due to heuristic alterations and lack robust mechanisms to counteract adversarial strategies, thus limiting their practicality in real-world scenarios. In this paper, we introduce an adaptive and robust watermarking method (named ModelShield) to protect the IP of LLMs. Our method incorporates a self-watermarking mechanism that allows LLMs to autonomously insert watermarks into their generated content to avoid the degradation of model content. We also propose a robust watermark detection mechanism capable of effectively identifying watermark signals under the interference of varying adversarial strategies. Besides, ModelShield is a plug-and-play method that does not require additional model training, enhancing its applicability in LLM deployments. Extensive evaluations on two real-world datasets and three LLMs demonstrate that our method surpasses existing methods in terms of defense effectiveness and robustness while significantly reducing the degradation of watermarking on the model-generated content.
翻译:大型语言模型(LLMs)在多种机器学习任务中展现出通用智能,从而提升了其知识产权的商业价值。为保护此类知识产权,模型所有者通常仅以黑盒方式向用户提供访问权限,然而攻击者仍可利用模型提取攻击窃取编码于模型生成过程中的智能。水印技术通过将唯一标识符嵌入模型生成内容,为防御此类攻击提供了有前景的解决方案。但现有水印方法常因启发式修改而损害生成内容质量,且缺乏对抗对抗策略的鲁棒机制,从而限制了其在实际场景中的实用性。本文提出一种自适应鲁棒水印方法(命名为ModelShield)以保护LLMs的知识产权。该方法引入自水印机制,使LLMs能自主将水印嵌入生成内容,避免模型内容质量下降。我们还提出一种鲁棒水印检测机制,能够在不同对抗策略干扰下有效识别水印信号。此外,ModelShield是一种即插即用方法,无需额外模型训练,增强了其在LLM部署中的适用性。在两个真实数据集和三个LLMs上的大量评估表明,本方法在防御效果与鲁棒性方面均超越现有方法,同时显著降低了水印对模型生成内容质量的损害。