This paper presents a novel framework for watermarking language models through prompts generated by language models. The proposed approach utilizes a multi-model setup, incorporating a Prompting language model to generate watermarking instructions, a Marking language model to embed watermarks within generated content, and a Detecting language model to verify the presence of these watermarks. Experiments are conducted using ChatGPT and Mistral as the Prompting and Marking language models, with detection accuracy evaluated using a pretrained classifier model. Results demonstrate that the proposed framework achieves high classification accuracy across various configurations, with 95% accuracy for ChatGPT, 88.79% for Mistral. These findings validate the and adaptability of the proposed watermarking strategy across different language model architectures. Hence the proposed framework holds promise for applications in content attribution, copyright protection, and model authentication.
翻译:本文提出了一种新颖的框架,通过语言模型生成的提示词对语言模型进行水印嵌入。该方案采用多模型架构,包含用于生成水印指令的提示语言模型、在生成内容中嵌入水印的标记语言模型,以及验证水印存在的检测语言模型。实验采用ChatGPT和Mistral分别作为提示与标记语言模型,并使用预训练分类器模型评估检测准确率。结果表明,所提框架在不同配置下均实现了较高的分类准确率:ChatGPT达到95%,Mistral达到88.79%。这些发现验证了所提水印策略在不同语言模型架构中的有效性与适应性。因此,该框架在内容溯源、版权保护和模型认证等领域具有广阔的应用前景。