Watermarking is a technique used to embed a hidden signal in the probability distribution of text generated by large language models (LLMs), enabling attribution of the text to the originating model. We introduce smoothing attacks and show that existing watermarking methods are not robust against minor modifications of text. An adversary can use weaker language models to smooth out the distribution perturbations caused by watermarks without significantly compromising the quality of the generated text. The modified text resulting from the smoothing attack remains close to the distribution of text that the original model (without watermark) would have produced. Our attack reveals a fundamental limitation of a wide range of watermarking techniques.
翻译:水印技术是一种在大型语言模型(LLM)生成的文本概率分布中嵌入隐藏信号的方法,用于将文本溯源至其生成模型。本文提出平滑攻击,并证明现有水印方法无法抵抗文本的细微修改。攻击者可以利用较弱的语言模型来平滑水印引入的分布扰动,同时不会显著降低生成文本的质量。经过平滑攻击修改后的文本,其分布仍接近于原始无水印模型可能生成的文本分布。我们的攻击揭示了一系列水印技术存在的根本性局限。