As LLMs become commonplace, machine-generated text has the potential to flood the internet with spam, social media bots, and valueless content. Watermarking is a simple and effective strategy for mitigating such harms by enabling the detection and documentation of LLM-generated text. Yet a crucial question remains: How reliable is watermarking in realistic settings in the wild? There, watermarked text may be modified to suit a user's needs, or entirely rewritten to avoid detection. We study the robustness of watermarked text after it is re-written by humans, paraphrased by a non-watermarked LLM, or mixed into a longer hand-written document. We find that watermarks remain detectable even after human and machine paraphrasing. While these attacks dilute the strength of the watermark, paraphrases are statistically likely to leak n-grams or even longer fragments of the original text, resulting in high-confidence detections when enough tokens are observed. For example, after strong human paraphrasing the watermark is detectable after observing 800 tokens on average, when setting a 1e-5 false positive rate. We also consider a range of new detection schemes that are sensitive to short spans of watermarked text embedded inside a large document, and we compare the robustness of watermarking to other kinds of detectors.
翻译:随着大型语言模型(LLM)的普及,机器生成的文本有可能使互联网充斥垃圾信息、社交机器人和无价值内容。水印是一种简单而有效的缓解此类危害的策略,它能够检测和记录LLM生成的文本。然而,一个关键问题仍然存在:在现实环境中,水印的可靠性如何?在那里,带水印的文本可能被修改以适应用户需求,或完全重写以避免检测。我们研究了带水印文本在被人类重写、被无水印LLM改写或混合到较长的手写文档中后的鲁棒性。我们发现,即使经过人类和机器改写,水印仍然可检测。虽然这些攻击削弱了水印的强度,但改写后的文本在统计上容易泄露原始文本的n-gram甚至更长的片段,导致在观察到足够多的标记时实现高置信度检测。例如,在强人类改写后,当设置1e-5的假阳性率时,平均观察800个标记即可检测到水印。我们还考虑了一系列新的检测方案,这些方案对嵌入在大型文档中的短跨度水印文本敏感,并比较了水印与其他类型检测器的鲁棒性。