Watermarking has emerged as a promising solution for tracing and authenticating text generated by large language models (LLMs). A common approach to LLM watermarking is to construct a green/red token list and assign higher or lower generation probabilities to the corresponding tokens, respectively. However, most existing watermarking algorithms rely on heuristic green/red token list designs, as directly optimizing the list design with techniques such as reinforcement learning (RL) comes with several challenges. First, desirable watermarking involves multiple criteria, i.e., detectability, text quality, robustness against removal attacks, and security against spoofing attacks. Directly optimizing for these criteria introduces many partially conflicting reward terms, leading to an unstable convergence process. Second, the vast action space of green/red token list choices is susceptible to reward hacking. In this paper, we propose an end-to-end RL framework for robust and secure LLM watermarking. Our approach adopts an anchoring mechanism for reward terms to ensure stable training and introduces additional regularization terms to prevent reward hacking. Experiments on standard benchmarks with two backbone LLMs show that our method achieves a state-of-the-art trade-off across all criteria, with notable improvements in resistance to spoofing attacks without degrading other criteria. Our code is available at https://github.com/UCSB-NLP-Chang/RL-watermark.
翻译:水印技术已成为追踪和验证大语言模型生成文本的一种前景广阔的解决方案。大语言模型水印的常见方法是构建绿/红令牌列表,并分别赋予对应令牌更高或更低的生成概率。然而,现有大多数水印算法依赖于启发式的绿/红令牌列表设计,因为直接使用强化学习等技术优化列表设计面临若干挑战。首先,理想的水印需满足多重标准,即可检测性、文本质量、抗去除攻击的鲁棒性以及抗伪造攻击的安全性。直接针对这些标准进行优化会引入多个部分冲突的奖励项,导致收敛过程不稳定。其次,绿/红令牌列表选择的巨大动作空间容易受到奖励欺骗的影响。本文提出一种端到端的强化学习框架,用于实现鲁棒且安全的大语言模型水印。我们的方法采用奖励项的锚定机制以确保训练稳定性,并引入额外的正则化项以防止奖励欺骗。在两个骨干大语言模型的标准基准测试中,实验结果表明我们的方法在所有标准上实现了最优的权衡,在保持其他标准不降级的前提下,抗伪造攻击能力获得显著提升。代码发布于 https://github.com/UCSB-NLP-Chang/RL-watermark。