In recent years, work has gone into developing deep interpretable methods for image classification that clearly attributes a model's output to specific features of the data. One such of these methods is the Prototypical Part Network (ProtoPNet), which attempts to classify images based on meaningful parts of the input. While this architecture is able to produce visually interpretable classifications, it often learns to classify based on parts of the image that are not semantically meaningful. To address this problem, we propose the Reward Reweighing, Reselecting, and Retraining (R3) post-processing framework, which performs three additional corrective updates to a pretrained ProtoPNet in an offline and efficient manner. The first two steps involve learning a reward model based on collected human feedback and then aligning the prototypes with human preferences. The final step is retraining, which realigns the base features and the classifier layer of the original model with the updated prototypes. We find that our R3 framework consistently improves both the interpretability and the predictive accuracy of ProtoPNet and its variants.
翻译:近年来,研究工作致力于开发用于图像分类的深度可解释方法,这些方法能够清晰地将模型输出归因于数据的特定特征。原型部件网络(ProtoPNet)便是此类方法之一,它试图基于输入图像中有意义的部分进行分类。尽管该架构能够产生视觉上可解释的分类结果,但其学习到的分类依据往往是图像中语义无关的部分。为解决这一问题,我们提出了奖励重加权、重选与重训练(R3)后处理框架,该框架以离线且高效的方式对预训练的ProtoPNet执行三项修正性更新。前两个步骤涉及基于收集到的人类反馈学习奖励模型,随后将原型与人类偏好对齐。最后一步是重训练,使原始模型的基础特征和分类器层与更新后的原型重新对齐。实验表明,我们的R3框架能够持续提升ProtoPNet及其变体的可解释性与预测准确性。