Recently, text watermarking algorithms for large language models (LLMs) have been proposed to mitigate the potential harms of text generated by LLMs, including fake news and copyright issues. However, current watermark detection algorithms require the secret key used in the watermark generation process, making them susceptible to security breaches and counterfeiting during public detection. To address this limitation, we propose an unforgeable publicly verifiable watermark algorithm that uses two different neural networks for watermark generation and detection, instead of using the same key at both stages. Meanwhile, the token embedding parameters are shared between the generation and detection networks, which makes the detection network achieve a high accuracy very efficiently. Experiments demonstrate that our algorithm attains high detection accuracy and computational efficiency through neural networks with a minimized number of parameters. Subsequent analysis confirms the high complexity involved in forging the watermark from the detection network. Our code and data are available at \href{https://github.com/THU-BPM/unforgeable_watermark}{https://github.com/THU-BPM/unforgeable\_watermark}.
翻译:近期,针对大语言模型(LLMs)的文本水印算法被提出,旨在减轻由LLMs生成的文本(包括假新闻和版权问题)的潜在危害。然而,现有水印检测算法需要使用水印生成过程中的密钥,导致其在公开检测时易受安全攻击和伪造。为解决这一局限性,我们提出一种不可伪造的公开可验证水印算法,该算法使用两个不同的神经网络分别进行水印生成和检测,而非在两个阶段采用相同密钥。同时,生成网络与检测网络共享token嵌入参数,使检测网络能够高效实现高精度检测。实验表明,通过参数最小化的神经网络,我们的算法实现了高检测精度和计算效率。后续分析证实了从检测网络中伪造水印具有极高的复杂性。我们的代码与数据开源在 https://github.com/THU-BPM/unforgeable_watermark。