Large language model (LLM) watermarking has emerged as a promising approach for detecting and attributing AI-generated text, yet its robustness to black-box spoofing remains insufficiently evaluated. Existing evaluation methods often demand extensive datasets and white-box access to algorithmic internals, limiting their practical applicability. In this paper, we study watermark resilience against spoofing fundamentally from a distributional perspective. We first establish a \textit{local capacity bottleneck}, which theoretically characterizes the probability mass that can be reallocated under KL-bounded local updates while preserving semantic fidelity. Building on this, we propose RLSpoofer, a reinforcement learning-based black-box spoofing attack that requires only 100 human-watermarked paraphrase training pairs and zero access to the watermarking internals or detectors. Despite weak supervision, it empowers a 4B model to achieve a 62.0\% spoof success rate with minimal semantic shift on PF-marked texts, dwarfing the 6\% of baseline models trained on up to 10,000 samples. Our findings expose the fragile spoofing resistance of current LLM watermarking paradigms, providing a lightweight evaluation framework and stressing the urgent need for more robust schemes.
翻译:大型语言模型(LLM)水印技术已发展为检测和归因AI生成文本的重要方法,然而其对黑盒欺骗攻击的鲁棒性仍缺乏充分评估。现有评估方法往往需要大规模数据集及对算法内部结构的白盒访问,严重制约了其实用性。本文从分布视角出发,对水印技术对抗欺骗攻击的鲁棒性展开基础性研究。首先,我们提出"局部容量瓶颈"理论,该理论从数学上刻画了在保证语义保真度的KL有界局部更新中,可被重新分配的概率质量上限。基于此,我们提出RLSpoofer——一种基于强化学习的黑盒欺骗攻击方法,该方法仅需100个人工水印释义训练对,且完全无需访问水印内部机制或检测器。尽管监督信号极弱,该方法可驱动4B参数模型对经PF标记文本实现62.0%的欺骗成功率,且语义偏移极小,而使用多达10,000个样本训练的基线模型成功率仅为6%。本研究表明当前LLM水印范式存在脆弱的抗欺骗能力,为轻量化评估提供了新框架,并凸显了开发更强鲁棒性方案的迫切需求。