The most effective techniques to detect LLM-generated text rely on inserting a detectable signature -- or watermark -- during the model's decoding process. Most existing watermarking methods require access to the underlying LLM's logits, which LLM API providers are loath to share due to fears of model distillation. As such, these watermarks must be implemented independently by each LLM provider. In this paper, we develop PostMark, a modular post-hoc watermarking procedure in which an input-dependent set of words (determined via a semantic embedding) is inserted into the text after the decoding process has completed. Critically, PostMark does not require logit access, which means it can be implemented by a third party. We also show that PostMark is more robust to paraphrasing attacks than existing watermarking methods: our experiments cover eight baseline algorithms, five base LLMs, and three datasets. Finally, we evaluate the impact of PostMark on text quality using both automated and human assessments, highlighting the trade-off between quality and robustness to paraphrasing. We release our code, outputs, and annotations at https://github.com/lilakk/PostMark.
翻译:检测大型语言模型(LLM)生成文本的最有效技术依赖于在模型解码过程中嵌入可检测的标记——即水印。现有的大多数水印方法需要访问底层LLM的logits输出,而LLM API提供商出于防范模型蒸馏的考虑通常不愿提供此类访问权限。因此,这些水印必须由各个LLM提供商独立实现。本文提出PostMark——一种模块化的后处理水印方案,该方案通过语义嵌入确定输入相关的词汇集合,并在解码过程完成后将其插入文本中。关键的是,PostMark无需logits访问权限,这意味着其实施方可以是第三方。我们还证明PostMark相比现有水印方法对改写攻击具有更强的鲁棒性:我们的实验涵盖八种基线算法、五个基础LLM以及三个数据集。最后,我们通过自动评估与人工评估相结合的方式,量化了PostMark对文本质量的影响,揭示了文本质量与抗改写鲁棒性之间的权衡关系。相关代码、输出结果及标注数据已发布于https://github.com/lilakk/PostMark。