LLM watermarking has attracted attention as a promising way to detect AI-generated content, with some works suggesting that current schemes may already be fit for deployment. In this work we dispute this claim, identifying watermark stealing (WS) as a fundamental vulnerability of these schemes. We show that querying the API of the watermarked LLM to approximately reverse-engineer a watermark enables practical spoofing attacks, as suggested in prior work, but also greatly boosts scrubbing attacks, which was previously unnoticed. We are the first to propose an automated WS algorithm and use it in the first comprehensive study of spoofing and scrubbing in realistic settings. We show that for under $50 an attacker can both spoof and scrub state-of-the-art schemes previously considered safe, with average success rate of over 80%. Our findings challenge common beliefs about LLM watermarking, stressing the need for more robust schemes. We make all our code and additional examples available at https://watermark-stealing.org.
翻译:LLM水印技术作为检测AI生成内容的一种有前景的方法引起了关注,部分研究认为现有方案已具备部署条件。本文对此提出质疑,指出水印窃取(WS)是这些方案的根本性漏洞。我们证明,通过查询带水印LLM的API对水印进行近似逆向工程,不仅能实现先前研究中提出的实际伪造攻击,还可显著提升此前未被察觉的擦除攻击效果。我们首次提出自动化WS算法,并在现实场景下开展了首个针对伪造与擦除攻击的全面研究。实验表明,攻击者只需不到50美元即可对先前被认为安全的先进方案实施伪造和擦除攻击,平均成功率超过80%。我们的发现挑战了关于LLM水印的普遍认知,凸显了开发更稳健方案的迫切性。所有代码及附加案例已开源至https://watermark-stealing.org。