Watermarking offers a promising solution for detecting LLM-generated content, yet its robustness under realistic query-free (black-box) evasion remains an open challenge. Existing query-free attacks often achieve limited success or severely distort semantic meaning. We bridge this gap by theoretically analyzing rewriting-based evasion, demonstrating that reducing the average conditional probability of sampling green tokens by a small margin causes the detection probability to decay exponentially. Guided by this insight, we propose the Bias-Inversion Rewriting Attack (BIRA), a practical query-free method that applies a negative logit bias to a proxy suppression set identified via token surprisal. Empirically, BIRA achieves state-of-the-art evasion rates (>99%) across diverse watermarking schemes while preserving semantic fidelity substantially better than prior baselines. Our findings reveal a fundamental vulnerability in current watermarking methods and highlight the need for rigorous stress tests.
翻译:水印技术为检测大语言模型生成内容提供了一种前景广阔的解决方案,然而其在现实无查询(黑盒)规避场景下的鲁棒性仍是一个开放挑战。现有的无查询攻击方法通常成功率有限或严重扭曲语义含义。我们通过理论分析基于重写的规避方法弥合了这一差距,证明了仅需将采样绿色标记的平均条件概率降低一个微小幅度,即可导致检测概率呈指数级衰减。基于这一洞见,我们提出了偏置反转重写攻击,这是一种实用的无查询方法,该方法通过对经由标记惊异值识别的代理抑制集施加负对数偏置来实现攻击。实证结果表明,BIRA在多种水印方案中均实现了最先进的规避率(>99%),同时其语义保真度显著优于现有基线方法。我们的研究揭示了当前水印方法存在根本性脆弱性,并凸显了进行严格压力测试的必要性。