Text watermarking technology aims to tag and identify content produced by large language models (LLMs) to prevent misuse. In this study, we introduce the concept of cross-lingual consistency in text watermarking, which assesses the ability of text watermarks to maintain their effectiveness after being translated into other languages. Preliminary empirical results from two LLMs and three watermarking methods reveal that current text watermarking technologies lack consistency when texts are translated into various languages. Based on this observation, we propose a Cross-lingual Watermark Removal Attack (CWRA) to bypass watermarking by first obtaining a response from an LLM in a pivot language, which is then translated into the target language. CWRA can effectively remove watermarks, decreasing the AUCs to a random-guessing level without performance loss. Furthermore, we analyze two key factors that contribute to the cross-lingual consistency in text watermarking and propose X-SIR as a defense method against CWRA. Code: https://github.com/zwhe99/X-SIR.
翻译:文本水印技术旨在标记和识别大语言模型(LLMs)生成的内容,以防止滥用。本研究引入了文本水印的跨语言一致性概念,用于评估文本水印在被翻译成其他语言后保持其有效性的能力。基于两种LLM和三种水印方法的初步实证结果表明,当文本被翻译成不同语言时,当前的文本水印技术缺乏一致性。基于此观察,我们提出了一种跨语言水印移除攻击(CWRA),通过首先从LLM获取枢轴语言的响应,然后将其翻译成目标语言来绕过水印。CWRA能够有效移除水印,将AUC降至随机猜测水平,且不造成性能损失。此外,我们分析了影响文本水印跨语言一致性的两个关键因素,并提出了X-SIR作为防御CWRA的方法。代码:https://github.com/zwhe99/X-SIR。