A generative AI model can generate extremely realistic-looking content, posing growing challenges to the authenticity of information. To address the challenges, watermark has been leveraged to detect AI-generated content. Specifically, a watermark is embedded into an AI-generated content before it is released. A content is detected as AI-generated if a similar watermark can be decoded from it. In this work, we perform a systematic study on the robustness of such watermark-based AI-generated content detection. We focus on AI-generated images. Our work shows that an attacker can post-process a watermarked image via adding a small, human-imperceptible perturbation to it, such that the post-processed image evades detection while maintaining its visual quality. We show the effectiveness of our attack both theoretically and empirically. Moreover, to evade detection, our adversarial post-processing method adds much smaller perturbations to AI-generated images and thus better maintain their visual quality than existing popular post-processing methods such as JPEG compression, Gaussian blur, and Brightness/Contrast. Our work shows the insufficiency of existing watermark-based detection of AI-generated content, highlighting the urgent needs of new methods. Our code is publicly available: https://github.com/zhengyuan-jiang/WEvade.
翻译:生成式AI模型能够生成极其逼真的内容,对信息的真实性构成日益严峻的挑战。为应对这些挑战,水印技术已被用于检测AI生成内容。具体而言,水印在AI生成内容发布前被嵌入其中。若能从内容中解码出相似的水印,则该内容被判定为AI生成。本研究系统考察了此类基于水印的AI生成内容检测技术的鲁棒性,重点聚焦于AI生成图像。我们的研究表明,攻击者可通过添加微小、人眼难以察觉的扰动对带水印图像进行后处理,使得经处理后的图像既能规避检测,又能保持其视觉质量。我们从理论和实证两方面验证了该攻击的有效性。此外,与JPEG压缩、高斯模糊、亮度/对比度调整等现有流行后处理方法相比,我们的对抗性后处理方法在规避检测时仅需向AI生成图像添加更小的扰动,因此能更好地保持其视觉质量。本研究揭示了现有基于水印的AI生成内容检测技术的不足,凸显出开发新方法的迫切需求。我们的代码已公开:https://github.com/zhengyuan-jiang/WEvade。