Text watermarking technology aims to tag and identify content produced by large language models (LLMs) to prevent misuse. In this study, we introduce the concept of ''cross-lingual consistency'' in text watermarking, which assesses the ability of text watermarks to maintain their effectiveness after being translated into other languages. Preliminary empirical results from two LLMs and three watermarking methods reveal that current text watermarking technologies lack consistency when texts are translated into various languages. Based on this observation, we propose a Cross-lingual Watermark Removal Attack (CWRA) to bypass watermarking by first obtaining a response from an LLM in a pivot language, which is then translated into the target language. CWRA can effectively remove watermarks by reducing the Area Under the Curve (AUC) from 0.95 to 0.67 without performance loss. Furthermore, we analyze two key factors that contribute to the cross-lingual consistency in text watermarking and propose a defense method that increases the AUC from 0.67 to 0.88 under CWRA.
翻译:文本水印技术旨在标记和识别大语言模型(LLM)生成的内容,以防止滥用。本研究首次提出文本水印中的“跨语言一致性”概念,用于评估文本水印在翻译为其他语言后保持有效性的能力。基于两种LLM和三种水印方法的初步实证结果表明,当前文本水印技术在文本被翻译为多语言时缺乏一致性。基于这一发现,我们提出跨语言水印移除攻击(CWRA),通过先获取LLM在枢轴语言中的响应,再将其翻译为目标语言来绕过水印。CWRA能在不损失性能的情况下有效移除水印,将曲线下面积(AUC)从0.95降至0.67。此外,我们分析了影响文本水印跨语言一致性的两个关键因素,并提出一种防御方法,能在CWRA攻击下将AUC从0.67提升至0.88。