The most effective techniques to detect LLM-generated text rely on inserting a detectable signature -- or watermark -- during the model's decoding process. Most existing watermarking methods require access to the underlying LLM's logits, which LLM API providers are loath to share due to fears of model distillation. As such, these watermarks must be implemented independently by each LLM provider. In this paper, we develop PostMark, a modular post-hoc watermarking procedure in which an input-dependent set of words (determined via a semantic embedding) is inserted into the text after the decoding process has completed. Critically, PostMark does not require logit access, which means it can be implemented by a third party. We also show that PostMark is more robust to paraphrasing attacks than existing watermarking methods: our experiments cover eight baseline algorithms, five base LLMs, and three datasets. Finally, we evaluate the impact of PostMark on text quality using both automated and human assessments, highlighting the trade-off between quality and robustness to paraphrasing. We release our code, outputs, and annotations at https://github.com/lilakk/PostMark.
翻译:检测大型语言模型(LLM)生成文本的最有效技术依赖于在模型解码过程中嵌入可检测的签名——即水印。现有水印方法大多需要访问底层LLM的logits输出,而LLM API提供商因担心模型蒸馏风险通常不愿提供此类访问权限。因此,这些水印必须由各个LLM提供商独立实现。本文提出PostMark——一种模块化的后置水印处理方案,该方法通过语义嵌入确定输入相关的词汇集合,并在解码过程完成后将其嵌入文本中。关键的是,PostMark无需logits访问权限,这意味着可由第三方独立实施。实验证明,PostMark相比现有水印方法对改述攻击具有更强的鲁棒性:我们的实验涵盖八种基线算法、五种基础LLM以及三个数据集。最后,我们通过自动评估与人工评估相结合的方式量化了PostMark对文本质量的影响,揭示了文本质量与抗改述鲁棒性之间的权衡关系。相关代码、输出结果及标注数据已发布于https://github.com/lilakk/PostMark。