As LLMs become commonplace, machine-generated text has the potential to flood the internet with spam, social media bots, and valueless content. Watermarking is a simple and effective strategy for mitigating such harms by enabling the detection and documentation of LLM-generated text. Yet a crucial question remains: How reliable is watermarking in realistic settings in the wild? There, watermarked text may be modified to suit a user's needs, or entirely rewritten to avoid detection. We study the robustness of watermarked text after it is re-written by humans, paraphrased by a non-watermarked LLM, or mixed into a longer hand-written document. We find that watermarks remain detectable even after human and machine paraphrasing. While these attacks dilute the strength of the watermark, paraphrases are statistically likely to leak n-grams or even longer fragments of the original text, resulting in high-confidence detections when enough tokens are observed. For example, after strong human paraphrasing the watermark is detectable after observing 800 tokens on average, when setting a 1e-5 false positive rate. We also consider a range of new detection schemes that are sensitive to short spans of watermarked text embedded inside a large document, and we compare the robustness of watermarking to other kinds of detectors.
翻译:随着大型语言模型(LLM)变得普及,机器生成的文本有潜力使互联网充斥垃圾信息、社交媒体机器人及无价值内容。水印技术是一种简单有效的缓解措施,能够检测并记录LLM生成的文本。然而一个关键问题依然存在:在现实场景中,水印技术的可靠性如何?在实际应用中,带水印的文本可能因用户需求被修改,或为规避检测而被完全重写。本研究探讨了水印文本经人类改写、非水印LLM改写或混合嵌入长篇手写文档后的鲁棒性。研究发现,即使经过人类和机器改写,水印仍可被检测。尽管这些攻击削弱了水印强度,但改写后的文本在统计上仍可能泄露原始文本的n-gram甚至更长片段,从而在观察到足够多的令牌时实现高置信度检测。例如,在经历高强度人类改写后,当设置1e-5的误报率时,平均观察到800个令牌即可检测到水印。此外,我们还研究了一系列新型检测方案,这些方案对嵌入长篇文档中的短片段水印文本敏感,并比较了水印技术与其他类型检测器的鲁棒性。