Watermarking has recently emerged as an effective strategy for detecting the outputs of large language models (LLMs). Most existing schemes require white-box access to the model's next-token probability distribution, which is typically not accessible to downstream users of an LLM API. In this work, we propose a principled watermarking scheme that requires only the ability to sample sequences from the LLM (i.e. black-box access), boasts a distortion-free property, and can be chained or nested using multiple secret keys. We provide performance guarantees, demonstrate how it can be leveraged when white-box access is available, and show when it can outperform existing white-box schemes via comprehensive experiments.
翻译:水印技术近来已成为检测大型语言模型(LLM)输出的有效策略。现有的大多数方案需要白盒访问模型的下一词元概率分布,而这通常是LLM API的下游用户无法获取的。在本研究中,我们提出一种原理性水印方案,该方案仅需具备从LLM中采样序列的能力(即黑盒访问),具有无失真特性,并可通过多个密钥实现链式或嵌套式应用。我们提供了性能保证,阐述了在具备白盒访问权限时如何利用该方案,并通过综合实验展示了其在何种情况下能够超越现有的白盒方案。