Despite the impressive synthesis quality of text-to-image (T2I) diffusion models, their black-box deployment poses significant regulatory challenges: Malicious actors can fine-tune these models to generate illegal content, circumventing existing safeguards through parameter manipulation. Therefore, it is essential to verify the integrity of T2I diffusion models. To this end, considering the randomness within the outputs of generative models and the high costs in interacting with them, we discern model tampering via the KL divergence between the distributions of the features of generated images. We propose a novel prompt selection algorithm based on learning automaton (PromptLA) for efficient and accurate verification. Evaluations on four advanced T2I models (e.g., SDXL, FLUX.1) demonstrate that our method achieves a mean AUC of over 0.96 in integrity detection, exceeding baselines by more than 0.2, showcasing strong effectiveness and generalization. Additionally, our approach achieves lower cost and is robust against image-level post-processing. To the best of our knowledge, this paper is the first work addressing the integrity verification of T2I diffusion models, which establishes quantifiable standards for AI copyright litigation in practice.
翻译:尽管文生图(T2I)扩散模型展现出令人印象深刻的合成质量,但其黑盒部署模式带来了严峻的监管挑战:恶意行为者可通过参数微调使模型生成非法内容,从而绕过现有安全机制。因此,对T2I扩散模型进行完整性验证至关重要。为此,考虑到生成模型输出的随机性及与其交互的高昂成本,我们通过计算生成图像特征分布之间的KL散度来识别模型篡改。我们提出了一种基于学习自动机的新型提示词选择算法(PromptLA),以实现高效且准确的验证。在四种先进T2I模型(如SDXL、FLUX.1)上的评估表明,我们的方法在完整性检测中取得了超过0.96的平均AUC值,较基线方法提升超过0.2,展现出强大的有效性和泛化能力。此外,我们的方法实现了更低的验证成本,并对图像级后处理操作具有鲁棒性。据我们所知,本文是首个针对T2I扩散模型完整性验证的研究工作,为实践中的人工智能版权诉讼建立了可量化的标准。