Large language models (LLMs) are now deployed to everyday use and positioned to produce large quantities of text in the coming decade. Machine-generated text may displace human-written text on the internet and has the potential to be used for malicious purposes, such as spearphishing attacks and social media bots. Watermarking is a simple and effective strategy for mitigating such harms by enabling the detection and documentation of LLM-generated text. Yet, a crucial question remains: How reliable is watermarking in realistic settings in the wild? There, watermarked text might be mixed with other text sources, paraphrased by human writers or other language models, and used for applications in a broad number of domains, both social and technical. In this paper, we explore different detection schemes, quantify their power at detecting watermarks, and determine how much machine-generated text needs to be observed in each scenario to reliably detect the watermark. We especially highlight our human study, where we investigate the reliability of watermarking when faced with human paraphrasing. We compare watermark-based detection to other detection strategies, finding overall that watermarking is a reliable solution, especially because of its sample complexity - for all attacks we consider, the watermark evidence compounds the more examples are given, and the watermark is eventually detected.
翻译:大型语言模型(LLM)现已投入日常使用,并有望在未来十年生成海量文本。机器生成的文本可能取代互联网上人类撰写的文本,且可能被用于恶意目的,例如鱼叉式网络攻击和社交媒体机器人。水印技术通过检测和记录LLM生成的文本,成为缓解此类危害的一种简单有效的策略。然而,一个关键问题仍然存在:在现实环境中,水印技术究竟有多可靠?在这种环境下,带水印的文本可能与其他来源的文本混合,被人类作者或其他语言模型改写,并广泛应用于社会和技术领域的众多场景。本文探讨了不同的检测方案,量化了它们检测水印的能力,并确定了每种场景下需要观察多少机器生成的文本才能可靠地检测到水印。我们特别强调人类研究部分,调查了面对人类改写时水印技术的可靠性。我们将基于水印的检测与其他检测策略进行对比,发现水印总体上是一种可靠的解决方案,这尤其得益于其样本复杂度——针对我们考虑的所有攻击方式,随着提供的样本增加,水印证据会不断累积,水印最终会被检测到。