We propose a methodology for planting watermarks in text from an autoregressive language model that are robust to perturbations without changing the distribution over text up to a certain maximum generation budget. We generate watermarked text by mapping a sequence of random numbers -- which we compute using a randomized watermark key -- to a sample from the language model. To detect watermarked text, any party who knows the key can align the text to the random number sequence. We instantiate our watermark methodology with two sampling schemes: inverse transform sampling and exponential minimum sampling. We apply these watermarks to three language models -- OPT-1.3B, LLaMA-7B and Alpaca-7B -- to experimentally validate their statistical power and robustness to various paraphrasing attacks. Notably, for both the OPT-1.3B and LLaMA-7B models, we find we can reliably detect watermarked text ($p \leq 0.01$) from $35$ tokens even after corrupting between $40$-$50$\% of the tokens via random edits (i.e., substitutions, insertions or deletions). For the Alpaca-7B model, we conduct a case study on the feasibility of watermarking responses to typical user instructions. Due to the lower entropy of the responses, detection is more difficult: around $25\%$ of the responses -- whose median length is around $100$ tokens -- are detectable with $p \leq 0.01$, and the watermark is also less robust to certain automated paraphrasing attacks we implement.
翻译:我们提出了一种在自回归语言模型生成的文本中植入水印的方法,该方法在不超过最大生成预算的情况下,能够在保持文本分布不变的同时抵抗扰动。我们通过将随机数序列(使用随机化水印密钥计算得到)与语言模型采样结果对齐来生成带水印文本。任何知晓该密钥的方均可通过将文本与随机数序列进行对齐来检测水印。我们采用两种采样方案实例化了该水印方法:逆变换采样和指数最小采样。我们将这些水印应用于三个语言模型(OPT-1.3B、LLaMA-7B和Alpaca-7B),通过实验验证其统计功效及对多种改写攻击的鲁棒性。值得注意的是,对于OPT-1.3B和LLaMA-7B模型,我们发现即使通过随机编辑(即替换、插入或删除)破坏了40%-50%的令牌,仍可在35个令牌后可靠地检测到水印文本(p ≤ 0.01)。对于Alpaca-7B模型,我们通过案例研究探讨了在典型用户指令响应中嵌入水印的可行性。由于响应的熵值较低,检测更为困难:约25%的响应(中位长度约100个令牌)可实现p ≤ 0.01的检测,同时水印对某些自动化改写攻击的鲁棒性也较弱。