Watermarking of large language models (LLMs) generation embeds an imperceptible statistical pattern within texts, making it algorithmically detectable. Watermarking is a promising method for addressing potential harm and biases from LLMs, as it enables traceability, accountability, and detection of manipulated content, helping to mitigate unintended consequences. However, for open-source models, watermarking faces two major challenges: (i) incompatibility with fine-tuned models, and (ii) vulnerability to fine-tuning attacks. In this work, we propose WAPITI, a new method that transfers watermarking from base models to fine-tuned models through parameter integration. To the best of our knowledge, we propose the first watermark for fine-tuned open-source LLMs that preserves their fine-tuned capabilities. Furthermore, our approach offers an effective defense against fine-tuning attacks. We test our method on various model architectures and watermarking strategies. Results demonstrate that our method can successfully inject watermarks and is highly compatible with fine-tuned models. Additionally, we offer an in-depth analysis of how parameter editing influences the watermark strength and overall capabilities of the resulting models.
翻译:大语言模型生成水印技术将一种难以察觉的统计模式嵌入到文本中,使其可通过算法进行检测。水印技术是应对大语言模型潜在危害和偏见的一种有前景的方法,因为它能够实现可追溯性、问责制以及对被操纵内容的检测,有助于减轻意外后果。然而,对于开源模型,水印技术面临两大挑战:(i) 与微调模型不兼容,以及 (ii) 易受微调攻击。在本工作中,我们提出了WAPITI,这是一种通过参数集成将水印从基础模型迁移到微调模型的新方法。据我们所知,我们提出了首个适用于微调开源大语言模型的水印方法,并能保持其微调后的能力。此外,我们的方法能有效防御微调攻击。我们在多种模型架构和水印策略上测试了我们的方法。结果表明,我们的方法能够成功注入水印,并且与微调模型高度兼容。此外,我们还深入分析了参数编辑如何影响水印强度以及最终模型的整体能力。