LLM watermarks stand out as a promising way to attribute ownership of LLM-generated text. One threat to watermark credibility comes from spoofing attacks, where an unauthorized third party forges the watermark, enabling it to falsely attribute arbitrary texts to a particular LLM. While recent works have demonstrated that state-of-the-art schemes are in fact vulnerable to spoofing, they lack deeper qualitative analysis of the texts produced by spoofing methods. In this work, we for the first time reveal that there are observable differences between genuine and spoofed watermark texts. Namely, we show that regardless of their underlying approach, all current spoofing methods consistently leave observable artifacts in spoofed texts, indicative of watermark forgery. We build upon these findings to propose rigorous statistical tests that reliably reveal the presence of such artifacts, effectively discovering that a watermark was spoofed. Our experimental evaluation shows high test power across all current spoofing methods, providing insights into their fundamental limitations, and suggesting a way to mitigate this threat.
翻译:语言模型水印作为一种有前景的技术手段,可用于追溯大语言模型生成文本的归属权。水印可信度面临的主要威胁之一来自伪造攻击,即未经授权的第三方伪造水印,从而将任意文本错误地归因于特定语言模型。尽管近期研究已证明最先进的水印方案实际上易受伪造攻击,但这些研究缺乏对伪造方法所生成文本的深入定性分析。本研究首次揭示了真实水印文本与伪造水印文本之间存在可观测差异。具体而言,我们证明无论采用何种底层方法,当前所有伪造手段都会在伪造文本中持续留下可观测的伪影特征,这些特征可作为水印伪造的判定依据。基于这些发现,我们提出了严格的统计检验方法,能够可靠地检测此类伪影的存在,从而有效识别水印是否被伪造。实验评估表明,该方法对所有现有伪造手段均具有较高的检验效力,不仅揭示了这些方法的根本局限性,也为缓解此类威胁提供了可行路径。