Deepfakes refer to content synthesized using deep generators, which, when misused, have the potential to erode trust in digital media. Synthesizing high-quality deepfakes requires access to large and complex generators only a few entities can train and provide. The threat is malicious users that exploit access to the provided model and generate harmful deepfakes without risking detection. Watermarking makes deepfakes detectable by embedding an identifiable code into the generator that is later extractable from its generated images. We propose Pivotal Tuning Watermarking (PTW), a method for watermarking pre-trained generators (i) three orders of magnitude faster than watermarking from scratch and (ii) without the need for any training data. We improve existing watermarking methods and scale to generators $4 \times$ larger than related work. PTW can embed longer codes than existing methods while better preserving the generator's image quality. We propose rigorous, game-based definitions for robustness and undetectability, and our study reveals that watermarking is not robust against an adaptive white-box attacker who controls the generator's parameters. We propose an adaptive attack that can successfully remove any watermarking with access to only 200 non-watermarked images. Our work challenges the trustworthiness of watermarking for deepfake detection when the parameters of a generator are available. The source code to reproduce our experiments is available at https://github.com/nilslukas/gan-watermark.
翻译:深度伪造是指利用深度生成器合成的内容,若被滥用,可能削弱数字媒体的可信度。合成高质量深度伪造内容需要访问大型且复杂的生成器,而此类生成器仅有少数实体能够训练并提供。威胁来自恶意用户,他们利用对已提供模型的访问权限生成有害深度伪造内容,且无需承担被检测的风险。水印技术通过在生成器中嵌入可识别编码(该编码随后可从生成图像的提取)使深度伪造内容可被检测。我们提出关键调优水印(PTW),一种针对预训练生成器的水印方法,其特点包括:(i)速度比从头训练水印快三个数量级;(ii)无需任何训练数据。我们改进了现有水印方法,并将其规模扩展至相关工作的四倍。PTW能嵌入比现有方法更长的编码,同时更好地保持生成器的图像质量。我们提出了基于博弈论的稳健性与不可检测性严格定义,研究揭示:当自适应白盒攻击者能控制生成器参数时,水印不具备稳健性。我们提出一种自适应攻击,仅需200张无水印图像即可成功移除任何水印。本研究对生成器参数可获取场景下基于水印的深度伪造检测可信度提出质疑。复现实验的源代码可在 https://github.com/nilslukas/gan-watermark 获取。