Advances in generative models have made it possible for AI-generated text, code, and images to mirror human-generated content in many applications. Watermarking, a technique that aims to embed information in the output of a model to verify its source, is useful for mitigating the misuse of such AI-generated content. However, we show that common design choices in LLM watermarking schemes make the resulting systems surprisingly susceptible to attack -- leading to fundamental trade-offs in robustness, utility, and usability. To navigate these trade-offs, we rigorously study a set of simple yet effective attacks on common watermarking systems, and propose guidelines and defenses for LLM watermarking in practice.
翻译:生成模型的进步使得人工智能生成的文本、代码和图像在许多应用中能够模拟人类生成的内容。水印技术旨在将信息嵌入模型输出中以验证其来源,对于减轻此类人工智能生成内容的滥用具有重要作用。然而,我们发现大语言模型水印方案中常见的设计选择使得最终系统出人意料地易受攻击——这导致了鲁棒性、实用性和可用性之间的根本性权衡。为了应对这些权衡,我们严格研究了对常见水印系统的一系列简单而有效的攻击,并为实际应用中的大语言模型水印提出了指导原则和防御措施。