Watermarking has recently emerged as an effective strategy for detecting the outputs of large language models (LLMs). Most existing schemes require \emph{white-box} access to the model's next-token probability distribution, which is typically not accessible to downstream users of an LLM API. In this work, we propose a principled watermarking scheme that requires only the ability to sample sequences from the LLM (i.e. \emph{black-box} access), boasts a \emph{distortion-free} property, and can be chained or nested using multiple secret keys. We provide performance guarantees, demonstrate how it can be leveraged when white-box access is available, and show when it can outperform existing white-box schemes via comprehensive experiments.
翻译:水印技术最近已成为检测大型语言模型(LLMs)输出的一种有效策略。现有的大多数方案需要获取模型下一个词元的概率分布(即白盒访问权限),而这通常是LLM API的下游用户无法访问的。在本研究中,我们提出了一种原理性水印方案,该方案仅需具备从LLM中采样序列的能力(即黑盒访问权限),具有无失真特性,并且可以使用多个密钥进行链式或嵌套式处理。我们提供了性能保证,展示了在白盒访问可用时如何利用该方案,并通过综合实验证明了其在某些情况下能够超越现有的白盒方案。