Diffusion Attribution Score: Evaluating Training Data Influence in Diffusion Model

As diffusion models become increasingly popular, the misuse of copyrighted and private images has emerged as a major concern. One promising solution to mitigate this issue is identifying the contribution of specific training samples in generative models, a process known as data attribution. Existing data attribution methods for diffusion models typically quantify the contribution of a training sample by evaluating the change in diffusion loss when the sample is included or excluded from the training process. However, we argue that the direct usage of diffusion loss cannot represent such a contribution accurately due to the calculation of diffusion loss. Specifically, these approaches measure the divergence between predicted and ground truth distributions, which leads to an indirect comparison between the predicted distributions and cannot represent the variances between model behaviors. To address these issues, we aim to measure the direct comparison between predicted distributions with an attribution score to analyse the training sample importance, which is achieved by Diffusion Attribution Score (DAS). Underpinned by rigorous theoretical analysis, we elucidate the effectiveness of DAS. Additionally, we explore strategies to accelerate DAS calculations, facilitating its application to large-scale diffusion models. Our extensive experiments across various datasets and diffusion models demonstrate that DAS significantly surpasses previous benchmarks in terms of the linear data-modelling score, establishing new state-of-the-art performance.

翻译：随着扩散模型日益普及，受版权保护图像和私人图像的滥用已成为主要关切。缓解此问题的一个有前景的解决方案是识别生成模型中特定训练样本的贡献，这一过程被称为数据归因。现有针对扩散模型的数据归因方法通常通过评估训练样本被纳入或排除训练过程时扩散损失的变化来量化其贡献。然而，我们认为，由于扩散损失的计算方式，直接使用扩散损失无法准确表征这种贡献。具体而言，这些方法测量预测分布与真实分布之间的差异，这导致预测分布之间只能进行间接比较，无法表征模型行为之间的差异。为解决这些问题，我们旨在通过一个归因分数来测量预测分布之间的直接比较，以分析训练样本的重要性，这通过扩散归因分数（DAS）实现。基于严格的理论分析，我们阐明了DAS的有效性。此外，我们探索了加速DAS计算的策略，以促进其在大规模扩散模型中的应用。我们在多个数据集和扩散模型上进行的大量实验表明，DAS在线性数据建模分数方面显著超越了先前的基准，确立了新的最先进性能。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日