We propose a methodology for planting watermarks in text from an autoregressive language model that are robust to perturbations without changing the distribution over text up to a certain maximum generation budget. We generate watermarked text by mapping a sequence of random numbers -- which we compute using a randomized watermark key -- to a sample from the language model. To detect watermarked text, any party who knows the key can align the text to the random number sequence. We instantiate our watermark methodology with two sampling schemes: inverse transform sampling and exponential minimum sampling. We apply these watermarks to three language models -- OPT-1.3B, LLaMA-7B and Alpaca-7B -- to experimentally validate their statistical power and robustness to various paraphrasing attacks. Notably, for both the OPT-1.3B and LLaMA-7B models, we find we can reliably detect watermarked text ($p \leq 0.01$) from $35$ tokens even after corrupting between $40$-$50\%$ of the tokens via random edits (i.e., substitutions, insertions or deletions). For the Alpaca-7B model, we conduct a case study on the feasibility of watermarking responses to typical user instructions. Due to the lower entropy of the responses, detection is more difficult: around $25\%$ of the responses -- whose median length is around $100$ tokens -- are detectable with $p \leq 0.01$, and the watermark is also less robust to certain automated paraphrasing attacks we implement.
翻译:我们提出了一种在自回归语言模型生成的文本中植入水印的方法,该方法在不超过特定最大生成预算的条件下,既能够抵抗扰动攻击,又不会改变文本的原始分布。我们通过将随机数序列——该序列使用随机水印密钥计算得出——映射至语言模型的采样结果来生成水印文本。任何知晓密钥的参与者均可通过比对文本与随机数序列来检测水印。我们采用两种采样方案实例化该水印方法:逆变换采样与指数最小采样。将所提水印应用于三个语言模型(OPT-1.3B、LLaMA-7B与Alpaca-7B),通过实验验证其统计效力及对多种释义攻击的鲁棒性。值得注意的是,对于OPT-1.3B与LLaMA-7B模型,即使经随机编辑(如替换、插入或删除)破坏了40%-50%的词元,我们仍能可靠检测出由35个词元构成的水印文本(p ≤ 0.01)。针对Alpaca-7B模型,我们以典型用户指令响应的水印可行性为例开展案例研究。由于响应文本熵值较低,检测难度显著增加:约25%的中位长度约100词元的响应可被检测(p ≤ 0.01),且该水印对我们实施的某些自动化释义攻击的鲁棒性较弱。