Recent advancements in large language models (LLMs) have highlighted the risk of misuse, raising concerns about accurately detecting LLM-generated content. A viable solution for the detection problem is to inject imperceptible identifiers into LLMs, known as watermarks. Previous work demonstrates that unbiased watermarks ensure unforgeability and preserve text quality by maintaining the expectation of the LLM output probability distribution. However, previous unbiased watermarking methods are impractical for local deployment because they rely on accesses to white-box LLMs and input prompts during detection. Moreover, these methods fail to provide statistical guarantees for the type II error of watermark detection. This study proposes the Sampling One Then Accepting (STA-1) method, an unbiased watermark that does not require access to LLMs nor prompts during detection and has statistical guarantees for the type II error. Moreover, we propose a novel tradeoff between watermark strength and text quality in unbiased watermarks. We show that in low-entropy scenarios, unbiased watermarks face a tradeoff between watermark strength and the risk of unsatisfactory outputs. Experimental results on low-entropy and high-entropy datasets demonstrate that STA-1 achieves text quality and watermark strength comparable to existing unbiased watermarks, with a low risk of unsatisfactory outputs. Implementation codes for this study are available online.
翻译:大型语言模型(LLM)的最新进展凸显了其被滥用的风险,引发了关于准确检测LLM生成内容的担忧。针对该检测问题的一种可行解决方案是在LLM中注入不可感知的标识符,即水印。已有研究表明,无偏水印通过保持LLM输出概率分布的期望,确保了不可伪造性并维持了文本质量。然而,现有的无偏水印方法在检测阶段依赖于对白盒LLM和输入提示的访问,因此难以在实际场景中本地部署。此外,这些方法未能为水印检测的第二类错误提供统计保证。本研究提出了“采样一次后接受”(STA-1)方法,这是一种在检测时既不需要访问LLM也不需要提示的无偏水印方案,且对第二类错误具有统计保证。此外,我们首次揭示了无偏水印中水印强度与文本质量之间的权衡关系。我们证明,在低熵场景下,无偏水印面临着水印强度与输出质量不足风险之间的权衡。在低熵与高熵数据集上的实验结果表明,STA-1在保持与现有无偏水印相当的文本质量和水印强度的同时,显著降低了输出质量不足的风险。本研究的实现代码已公开在线发布。