Potential harms of large language models can be mitigated by watermarking model output, i.e., embedding signals into generated text that are invisible to humans but algorithmically detectable from a short span of tokens. We propose a watermarking framework for proprietary language models. The watermark can be embedded with negligible impact on text quality, and can be detected using an efficient open-source algorithm without access to the language model API or parameters. The watermark works by selecting a randomized set of "green" tokens before a word is generated, and then softly promoting use of green tokens during sampling. We propose a statistical test for detecting the watermark with interpretable p-values, and derive an information-theoretic framework for analyzing the sensitivity of the watermark. We test the watermark using a multi-billion parameter model from the Open Pretrained Transformer (OPT) family, and discuss robustness and security.
翻译:大语言模型的潜在危害可通过对其输出添加水印来缓解,即在生成的文本中嵌入对人类不可见但可通过短序列令牌进行算法检测的信号。我们提出了一种适用于专有语言模型的水印框架。该水印能以极小代价嵌入文本质量,且无需访问语言模型API或参数即可通过高效的开源算法进行检测。其工作原理是在生成单词前随机选择一组"绿色"令牌,并在采样过程中温和地促进使用这些绿色令牌。我们提出了一种具有可解释p值的统计检测方法,并推导了用于分析水印敏感性的信息论框架。我们采用Open Pretrained Transformer(OPT)系列中的数十亿参数模型进行了水印测试,同时探讨了其鲁棒性与安全性问题。