Advances in generative models have made it possible for AI-generated text, code, and images to mirror human-generated content in many applications. Watermarking, a technique that aims to embed information in the output of a model to verify its source, is useful for mitigating misuse of such AI-generated content. However, existing watermarking schemes remain surprisingly susceptible to attack. In particular, we show that desirable properties shared by existing LLM watermarking systems such as quality preservation, robustness, and public detection APIs can in turn make these systems vulnerable to various attacks. We rigorously study potential attacks in terms of common watermark design choices, and propose best practices and defenses for mitigation -- establishing a set of practical guidelines for embedding and detection of LLM watermarks.
翻译:生成模型的进步使得人工智能生成的文本、代码和图像在许多应用中能够镜像人类生成的内容。水印技术作为一种旨在模型输出中嵌入信息以验证其来源的方法,对于减轻此类人工智能生成内容的滥用具有重要作用。然而,现有水印方案仍然出人意料地容易受到攻击。特别地,我们表明现有大语言模型水印系统共有的理想特性(如质量保持、鲁棒性和公开检测接口)反过来可能使这些系统易受各种攻击。我们针对常见水印设计选择严格研究了潜在攻击,并提出了用于缓解的实践指南和防御措施——从而建立了一套用于大语言模型水印嵌入和检测的实用准则。