With the rapid development of cloud-based services, large language models have become increasingly accessible through various web platforms. However, this accessibility has also led to growing risks of model abuse. LLM watermarking has emerged as an effective approach to mitigate such misuse and protect intellectual property. Existing watermarking algorithms, however, primarily focus on defending against paraphrase attacks while overlooking piggyback spoofing attacks, which can inject harmful content, compromise watermark reliability, and undermine trust in attribution. To address this limitation, we propose DualGuard, the first watermarking algorithm capable of defending against both paraphrase and spoofing attacks. DualGuard employs the adaptive dual-stream watermarking mechanism, in which two complementary watermark signals are dynamically injected based on the semantic content. This design enables DualGuard not only to detect but also to trace spoofing attacks, thereby ensuring reliable and trustworthy watermark detection. Extensive experiments conducted across multiple datasets and language models demonstrate that DualGuard achieves excellent detectability, robustness, traceability, and text quality, effectively advancing the state of LLM watermarking for real-world applications.
翻译:摘要:随着云端服务的快速发展,大语言模型通过各类网络平台变得日益普及。然而,这种便捷性也导致模型滥用风险不断加剧。大语言模型水印技术应运而生,成为缓解滥用行为并保护知识产权的有效手段。现有水印算法主要聚焦于抵御释义攻击,却忽视了掩盖式欺骗攻击——这种攻击可注入有害内容,破坏水印可靠性,并动摇对归因机制的信任。为解决这一局限,我们提出DualGuard,这是首个能够同时防御释义攻击与欺骗攻击的水印算法。DualGuard采用自适应双流水印机制,根据语义内容动态注入两种互补水印信号。该设计不仅能够检测欺骗攻击,还可追溯攻击源头,从而确保水印检测的可靠性与可信度。在多个数据集及语言模型上开展的大量实验表明,DualGuard在可检测性、鲁棒性、可追溯性与文本质量方面表现优异,有效推动了面向实际应用的大语言模型水印技术发展。