Foundational generative models should be traceable to protect their owners and facilitate safety regulation. To achieve this, traditional approaches embed identifiers based on supervisory trigger-response signals, which are commonly known as backdoor watermarks. They are prone to failure when the model is fine-tuned with nontrigger data. Our experiments show that this vulnerability is due to energetic changes in only a few 'busy' layers during fine-tuning. This yields a novel arbitrary-in-arbitrary-out (AIAO) strategy that makes watermarks resilient to fine-tuning-based removal. The trigger-response pairs of AIAO samples across various neural network depths can be used to construct watermarked subpaths, employing Monte Carlo sampling to achieve stable verification results. In addition, unlike the existing methods of designing a backdoor for the input/output space of diffusion models, in our method, we propose to embed the backdoor into the feature space of sampled subpaths, where a mask-controlled trigger function is proposed to preserve the generation performance and ensure the invisibility of the embedded backdoor. Our empirical studies on the MS-COCO, AFHQ, LSUN, CUB-200, and DreamBooth datasets confirm the robustness of AIAO; while the verification rates of other trigger-based methods fall from ~90% to ~70% after fine-tuning, those of our method remain consistently above 90%.
翻译:基础生成模型应具备可追溯性,以保护其所有者并促进安全监管。传统方法基于监督式触发-响应信号嵌入标识符(即通常所说的后门水印),但当模型使用非触发数据进行微调时,这些方法容易失效。我们的实验表明,这种脆弱性源于微调过程中仅少数"活跃"层发生的能量变化。这催生了一种新型的任意输入-任意输出(AIAO)策略,使水印能够抵御基于微调的移除。AIAO在不同神经网络深度的触发-响应样本对可用于构建水印子路径,通过蒙特卡洛采样实现稳定的验证结果。此外,与现有为扩散模型输入/输出空间设计后门的方法不同,我们提出将后门嵌入采样子路径的特征空间,其中设计了掩码控制触发函数以保持生成性能并确保后门的不可见性。我们在MS-COCO、AFHQ、LSUN、CUB-200和DreamBooth数据集上的实验证实了AIAO的鲁棒性:其他基于触发的方法在微调后验证率从约90%下降至约70%,而我们的方法始终保持在90%以上。