Text-to-image synthesis has become highly popular for generating realistic and stylized images, often requiring fine-tuning generative models with domain-specific datasets for specialized tasks. However, these valuable datasets face risks of unauthorized usage and unapproved sharing, compromising the rights of the owners. In this paper, we address the issue of dataset abuse during the fine-tuning of Stable Diffusion models for text-to-image synthesis. We present a dataset watermarking framework designed to detect unauthorized usage and trace data leaks. The framework employs two key strategies across multiple watermarking schemes and is effective for large-scale dataset authorization. Extensive experiments demonstrate the framework's effectiveness, minimal impact on the dataset (only 2% of the data required to be modified for high detection accuracy), and ability to trace data leaks. Our results also highlight the robustness and transferability of the framework, proving its practical applicability in detecting dataset abuse.
翻译:文本到图像合成技术因其能够生成逼真且风格化的图像而广受欢迎,通常需要针对特定任务使用领域专用数据集对生成模型进行微调。然而,这些宝贵的数据集面临着未经授权使用和未经批准共享的风险,损害了数据所有者的权益。本文针对文本到图像合成中微调Stable Diffusion模型时的数据集滥用问题展开研究。我们提出了一种数据集水印框架,旨在检测未经授权的使用并追踪数据泄露。该框架在多种水印方案中采用两种关键策略,适用于大规模数据集授权。大量实验证明了该框架的有效性、对数据集影响极小(仅需修改2%的数据即可实现高检测精度)以及追踪数据泄露的能力。我们的结果还突显了该框架的鲁棒性和可迁移性,证实了其在检测数据集滥用方面的实际适用性。