As AI-generated text increasingly resembles human-written content, the ability to detect machine-generated text becomes crucial. To address this challenge, we present GPTWatermark, a robust and high-quality solution designed to ascertain whether a piece of text originates from a specific model. Our approach extends existing watermarking strategies and employs a fixed group design to enhance robustness against editing and paraphrasing attacks. We show that our watermarked language model enjoys strong provable guarantees on generation quality, correctness in detection, and security against evasion attacks. Experimental results on various large language models (LLMs) and diverse datasets demonstrate that our method achieves superior detection accuracy and comparable generation quality in perplexity, thus promoting the responsible use of LLMs.
翻译:随着AI生成文本日益接近人类书写内容,检测机器生成文本的能力变得至关重要。为应对这一挑战,我们提出GPTWatermark——一种鲁棒且高质量的解决方案,用于判定文本是否源自特定模型。该方法拓展了现有水印策略,采用固定分组设计以增强对编辑和改写攻击的鲁棒性。我们证明,该水印语言模型在生成质量、检测正确性及抗规避攻击安全性方面具备强可验证保证。在多种大语言模型(LLM)及多样化数据集上的实验结果表明,该方法在困惑度指标上实现了更优的检测准确率与相当的生成质量,从而促进LLM的负责任使用。