We study watermarking schemes for language models with provable guarantees. As we show, prior works offer no robustness guarantees against adaptive prompting: when a user queries a language model more than once, as even benign users do. And with just a single exception (Christ and Gunn, 2024), prior works are restricted to zero-bit watermarking: machine-generated text can be detected as such, but no additional information can be extracted from the watermark. Unfortunately, merely detecting AI-generated text may not prevent future abuses. We introduce multi-user watermarks, which allow tracing model-generated text to individual users or to groups of colluding users, even in the face of adaptive prompting. We construct multi-user watermarking schemes from undetectable, adaptively robust, zero-bit watermarking schemes (and prove that the undetectable zero-bit scheme of Christ, Gunn, and Zamir (2024) is adaptively robust). Importantly, our scheme provides both zero-bit and multi-user assurances at the same time. It detects shorter snippets just as well as the original scheme, and traces longer excerpts to individuals. The main technical component is a construction of message-embedding watermarks from zero-bit watermarks. Ours is the first generic reduction between watermarking schemes for language models. A challenge for such reductions is the lack of a unified abstraction for robustness -- that marked text is detectable even after edits. We introduce a new unifying abstraction called AEB-robustness. AEB-robustness provides that the watermark is detectable whenever the edited text "approximates enough blocks" of model-generated output.
翻译:本研究探讨具有可证明保证的语言模型水印方案。如我们所证,现有研究未能提供针对自适应查询的鲁棒性保证:即使用户仅进行多次查询(即便是善意用户亦会如此)。除个别特例外,现有工作仅限于零比特水印:仅能检测机器生成文本,无法从水印中提取额外信息。然而,仅检测AI生成文本可能不足以防止未来滥用行为。本文提出多用户水印方案,即使在自适应查询场景下,仍能追踪模型生成文本至个体用户或合谋用户群体。我们基于不可检测、自适应鲁棒的零比特水印方案构建多用户水印系统(并证明Christ、Gunn与Zamir的不可检测零比特方案具备自适应鲁棒性)。值得注意的是,本方案同时提供零比特与多用户双重保证:对短文本片段的检测能力与原方案相当,对长文本则能追踪至个体用户。核心技术贡献在于实现了从零比特水印到消息嵌入水印的构造方法,这是语言模型水印方案间的首次通用规约。此类规约面临的核心挑战在于缺乏统一的鲁棒性抽象框架——即编辑后标记文本仍可检测的特性。为此,我们提出名为AEB-鲁棒性的新型统一抽象框架,该框架确保只要编辑文本"充分近似足够区块"的模型生成输出,水印即可被检测。