Identifying the training data samples that most influence a generated image is a critical task in understanding diffusion models, yet existing influence estimation methods are constrained to small-scale or LoRA-tuned models due to computational limitations. As diffusion models scale up, these methods become impractical. To address this challenge, we propose DMin (Diffusion Model influence), a scalable framework for estimating the influence of each training data sample on a given generated image. By leveraging efficient gradient compression and retrieval techniques, DMin reduces storage requirements from 339.39 TB to only 726 MB and retrieves the top-k most influential training samples in under 1 second, all while maintaining performance. Our empirical results demonstrate DMin is both effective in identifying influential training samples and efficient in terms of computational and storage requirements.
翻译:识别对生成图像影响最大的训练数据样本是理解扩散模型的关键任务,然而现有影响估计方法受限于计算资源,仅能应用于小规模或经LoRA调优的模型。随着扩散模型规模的扩大,这些方法变得不切实际。为应对这一挑战,我们提出DMin(扩散模型影响估计框架),这是一个可扩展的框架,用于估计每个训练数据样本对给定生成图像的影响。通过采用高效的梯度压缩与检索技术,DMin将存储需求从339.39 TB降低至仅726 MB,并在1秒内检索出最具影响力的前k个训练样本,同时保持性能不变。实证结果表明,DMin在识别有影响力的训练样本方面既具有效性,又在计算与存储资源需求上表现出高效性。