Text watermarking for Large Language Models (LLMs) has made significant progress in detecting LLM outputs and preventing misuse. Current watermarking techniques offer high detectability, minimal impact on text quality, and robustness to text editing. However, current researches lack investigation into the imperceptibility of watermarking techniques in LLM services. This is crucial as LLM providers may not want to disclose the presence of watermarks in real-world scenarios, as it could reduce user willingness to use the service and make watermarks more vulnerable to attacks. This work is the first to investigate the imperceptibility of watermarked LLMs. We design an identification algorithm called Water-Probe that detects watermarks through well-designed prompts to the LLM. Our key motivation is that current watermarked LLMs expose consistent biases under the same watermark key, resulting in similar differences across prompts under different watermark keys. Experiments show that almost all mainstream watermarking algorithms are easily identified with our well-designed prompts, while Water-Probe demonstrates a minimal false positive rate for non-watermarked LLMs. Finally, we propose that the key to enhancing the imperceptibility of watermarked LLMs is to increase the randomness of watermark key selection. Based on this, we introduce the Water-Bag strategy, which significantly improves watermark imperceptibility by merging multiple watermark keys.
翻译:大型语言模型(LLM)的文本水印技术在检测模型输出和防止滥用方面已取得显著进展。当前的水印技术具有高可检测性、对文本质量影响极小以及对文本编辑的鲁棒性。然而,现有研究缺乏对LLM服务中水印技术不可感知性的探究。这一点至关重要,因为在现实场景中,LLM提供商可能不希望披露水印的存在,否则可能降低用户使用服务的意愿,并使水印更易受到攻击。本研究首次系统探究带水印LLM的不可感知性。我们设计了一种名为Water-Probe的识别算法,该算法通过向LLM输入精心设计的提示词来检测水印。我们的核心动机在于:当前带水印的LLM在同一水印密钥下会暴露出一致的偏差,导致在不同水印密钥下对提示词产生相似的差异响应。实验表明,几乎所有主流水印算法都能通过我们设计的提示词被轻易识别,而Water-Probe对无水印LLM的误报率极低。最后,我们提出增强带水印LLM不可感知性的关键在于提高水印密钥选择的随机性。基于此,我们引入了Water-Bag策略,该策略通过融合多个水印密钥,显著提升了水印的不可感知性。