Counterfactual explanations have emerged as a promising method for elucidating the behavior of opaque black-box models. Recently, several works leveraged pixel-space diffusion models for counterfactual generation. To handle noisy, adversarial gradients during counterfactual generation -- causing unrealistic artifacts or mere adversarial perturbations -- they required either auxiliary adversarially robust models or computationally intensive guidance schemes. However, such requirements limit their applicability, e.g., in scenarios with restricted access to the model's training data. To address these limitations, we introduce Latent Diffusion Counterfactual Explanations (LDCE). LDCE harnesses the capabilities of recent class- or text-conditional foundation latent diffusion models to expedite counterfactual generation and focus on the important, semantic parts of the data. Furthermore, we propose a novel consensus guidance mechanism to filter out noisy, adversarial gradients that are misaligned with the diffusion model's implicit classifier. We demonstrate the versatility of LDCE across a wide spectrum of models trained on diverse datasets with different learning paradigms. Finally, we showcase how LDCE can provide insights into model errors, enhancing our understanding of black-box model behavior.
翻译:反事实解释已成为阐明不透明黑箱模型行为的一种有前景的方法。近期,若干工作利用像素空间扩散模型生成反事实。为了处理反事实生成过程中噪声和对抗性梯度——这些梯度会导致不真实的伪影或仅仅是对抗性扰动——它们要么需要辅助的对抗鲁棒模型,要么需要计算密集的引导方案。然而,这些要求限制了它们的适用性,例如在无法完全访问模型训练数据的场景中。为了解决这些限制,我们提出了潜在扩散反事实解释(LDCE)。LDCE 利用最新的类条件或文本条件基础潜在扩散模型的能力,以加速反事实生成并聚焦于数据中重要的语义部分。此外,我们提出了一种新颖的共识引导机制,用于过滤与扩散模型隐式分类器不一致的噪声和对抗性梯度。我们展示了 LDCE 在多种数据集上以不同学习范式训练的各类模型中的广泛适用性。最后,我们展示了 LDCE 如何能够提供对模型错误的洞察,从而增强我们对黑箱模型行为的理解。