Amidst rising concerns about the internet being proliferated with content generated from language models (LMs), watermarking is seen as a principled way to certify whether text was generated from a model. Many recent watermarking techniques slightly modify the output probabilities of LMs to embed a signal in the generated output that can later be detected. Since early proposals for text watermarking, questions about their robustness to paraphrasing have been prominently discussed. Lately, some techniques are deliberately designed and claimed to be robust to paraphrasing. However, such watermarking schemes do not adequately account for the ease with which they can be reverse-engineered. We show that with access to only a limited number of generations from a black-box watermarked model, we can drastically increase the effectiveness of paraphrasing attacks to evade watermark detection, thereby rendering the watermark ineffective.
翻译:随着对互联网上语言模型生成内容泛滥的担忧日益加剧,水印技术被视为一种认证文本是否源自模型的原则性方法。许多近期水印技术通过轻微调整语言模型的输出概率,在生成文本中嵌入可被后续检测的信号。自文本水印的早期方案提出以来,其抵抗改写的鲁棒性问题始终备受关注。近来,一些技术被专门设计并宣称具备抗改写鲁棒性。然而,此类水印方案未能充分考虑其被逆向破解的简易性。我们证明,仅需访问黑盒水印模型的有限生成样本,即可显著提升改写攻击规避水印检测的效果,从而使水印机制失效。