Deepfakes refer to content synthesized using deep generators, which, when misused, have the potential to erode trust in digital media. Synthesizing high-quality deepfakes requires access to large and complex generators only a few entities can train and provide. The threat is malicious users that exploit access to the provided model and generate harmful deepfakes without risking detection. Watermarking makes deepfakes detectable by embedding an identifiable code into the generator that is later extractable from its generated images. We propose Pivotal Tuning Watermarking (PTW), a method for watermarking pre-trained generators (i) three orders of magnitude faster than watermarking from scratch and (ii) without the need for any training data. We improve existing watermarking methods and scale to generators $4 \times$ larger than related work. PTW can embed longer codes than existing methods while better preserving the generator's image quality. We propose rigorous, game-based definitions for robustness and undetectability and our study reveals that watermarking is not robust against an adaptive white-box attacker who has control over the generator's parameters. We propose an adaptive attack that can successfully remove any watermarking with access to only $200$ non-watermarked images. Our work challenges the trustworthiness of watermarking for deepfake detection when the parameters of a generator are available. Source code to reproduce our experiments is available at https://github.com/dnn-security/gan-watermark.
翻译:深度伪造指利用深度生成器合成的内容,若被滥用则可能侵蚀数字媒体的可信度。高质量深度伪造的合成需要访问大规模复杂生成器,而这类生成器仅少数实体能够训练和提供。其威胁在于恶意用户利用对提供模型的访问权限生成有害深度伪造内容,且无需承担被检测风险。水印技术通过向生成器中嵌入可识别编码,使其后续可从生成图像中提取,从而使深度伪造可被检测。我们提出关键调优水印(PTW)方法,用于对预训练生成器进行水印嵌入:(i)其速度比从头训练水印方法快三个数量级;(ii)无需任何训练数据。我们改进了现有水印方法,并将规模扩展至相关工作四倍大的生成器。PTW能嵌入比现有方法更长的编码,同时更好地保持生成器的图像质量。我们提出了严谨的基于博弈的鲁棒性与不可检测性定义,研究表明当自适应白盒攻击者能够控制生成器参数时,水印不具备鲁棒性。我们提出一种自适应攻击方法,仅需200张无水印图像即可成功移除任意水印。本工作揭示了当生成器参数可获取时,基于水印的深度伪造检测可信度面临挑战。重现实验的源代码已发布在https://github.com/dnn-security/gan-watermark。