Recently, text watermarking algorithms for large language models (LLMs) have been proposed to mitigate the potential harms of text generated by LLMs, including fake news and copyright issues. However, current watermark detection algorithms require the secret key used in the watermark generation process, making them susceptible to security breaches and counterfeiting during public detection. To address this limitation, we propose an unforgeable publicly verifiable watermark algorithm that uses two different neural networks for watermark generation and detection, instead of using the same key at both stages. Meanwhile, the token embedding parameters are shared between the generation and detection networks, which makes the detection network achieve a high accuracy very efficiently. Experiments demonstrate that our algorithm attains high detection accuracy and computational efficiency through neural networks with a minimized number of parameters. Subsequent analysis confirms the high complexity involved in forging the watermark from the detection network. Our code and data are available at \href{https://github.com/THU-BPM/unforgeable_watermark}{https://github.com/THU-BPM/unforgeable\_watermark}.
翻译:近期,针对大语言模型(LLMs)的文本水印算法被提出,以缓解LLMs生成文本的潜在危害(包括虚假新闻与版权问题)。然而,现有水印检测算法要求使用与水印生成阶段相同的密钥,导致其在公开检测过程中容易遭受安全攻击和伪造。针对该局限,我们提出一种不可伪造的公开可验证水印算法,该算法采用两种不同的神经网络分别进行水印生成与检测,而非在两阶段共享同一密钥。同时,生成网络与检测网络共享词元嵌入参数,使检测网络能够高效实现高精度检测。实验表明,本算法通过参数最小化的神经网络实现了高检测精度与计算效率。后续分析证实了从检测网络伪造水印具有极强复杂性。我们的代码与数据公开于\href{https://github.com/THU-BPM/unforgeable_watermark}{https://github.com/THU-BPM/unforgeable\_watermark}。