While deep learning has led to huge progress in complex image classification tasks like ImageNet, unexpected failure modes, e.g. via spurious features, call into question how reliably these classifiers work in the wild. Furthermore, for safety-critical tasks the black-box nature of their decisions is problematic, and explanations or at least methods which make decisions plausible are needed urgently. In this paper, we address these problems by generating images that optimize a classifier-derived objective using a framework for guided image generation. We analyze the behavior and decisions of image classifiers by visual counterfactual explanations (VCEs), detection of systematic mistakes by analyzing images where classifiers maximally disagree, and visualization of neurons to verify potential spurious features. In this way, we validate existing observations, e.g. the shape bias of adversarially robust models, as well as novel failure modes, e.g. systematic errors of zero-shot CLIP classifiers, or identify harmful spurious features. Moreover, our VCEs outperform previous work while being more versatile.
翻译:尽管深度学习在ImageNet等复杂图像分类任务中取得了巨大进展,但意外失效模式(例如通过虚假特征)对分类器在实际场景中的可靠性提出了质疑。此外,对于安全关键型任务,决策的黑箱性质令人担忧,亟需解释性方法或至少能使决策具有合理性的手段。本文针对这些问题,采用引导式图像生成框架,通过生成优化分类器目标函数的图像来进行分析。我们通过视觉反事实解释(VCEs)、分析分类器最大分歧图像以检测系统性错误、以及可视化神经元以验证潜在虚假特征等方式,剖析图像分类器的行为与决策。由此,我们验证了既有发现(例如对抗鲁棒模型的形状偏好)与新型失效模式(例如零样本CLIP分类器的系统性错误),或识别出有害的虚假特征。此外,我们的VCE方法在性能上超越先前工作,且更具通用性。