Untrustworthy users can misuse image generators to synthesize high-quality deepfakes and engage in online spam or disinformation campaigns. Watermarking deters misuse by marking generated content with a hidden message, enabling its detection using a secret watermarking key. A core security property of watermarking is robustness, which states that an attacker can only evade detection by substantially degrading image quality. Assessing robustness requires designing an adaptive attack for the specific watermarking algorithm. A challenge when evaluating watermarking algorithms and their (adaptive) attacks is to determine whether an adaptive attack is optimal, i.e., it is the best possible attack. We solve this problem by defining an objective function and then approach adaptive attacks as an optimization problem. The core idea of our adaptive attacks is to replicate secret watermarking keys locally by creating surrogate keys that are differentiable and can be used to optimize the attack's parameters. We demonstrate for Stable Diffusion models that such an attacker can break all five surveyed watermarking methods at negligible degradation in image quality. These findings emphasize the need for more rigorous robustness testing against adaptive, learnable attackers.
翻译:不可信用户可能滥用图像生成器合成高质量的深度伪造内容,并参与在线垃圾信息或虚假宣传活动。水印技术通过在生成内容中嵌入隐藏消息来阻止滥用,从而使用秘密水印密钥进行检测。水印的一个核心安全属性是鲁棒性,即攻击者只有显著降低图像质量才能规避检测。评估鲁棒性需要针对特定水印算法设计自适应攻击。在评估水印算法及其(自适应)攻击时,一个挑战是确定自适应攻击是否最优,即是否为最佳可能的攻击。我们通过定义目标函数,然后将自适应攻击作为优化问题来解决这个问题。我们自适应攻击的核心思想是通过创建可微分的替代密钥来本地复制秘密水印密钥,并利用这些密钥优化攻击参数。我们针对Stable Diffusion模型证明,这种攻击者能够在图像质量几乎不衰减的情况下,破解所有五种被调查的水印方法。这些发现强调了针对自适应、可学习攻击者进行更严格鲁棒性测试的必要性。