In recent years, work has gone into developing deep interpretable methods for image classification that clearly attributes a model's output to specific features of the data. One such of these methods is the prototypical part network (ProtoPNet), which attempts to classify images based on meaningful parts of the input. While this method results in interpretable classifications, this method often learns to classify from spurious or inconsistent parts of the image. Hoping to remedy this, we take inspiration from the recent developments in Reinforcement Learning with Human Feedback (RLHF) to fine-tune these prototypes. By collecting human annotations of prototypes quality via a 1-5 scale on the CUB-200-2011 dataset, we construct a reward model that learns to identify non-spurious prototypes. In place of a full RL update, we propose the reweighted, reselected, and retrained prototypical part network (R3-ProtoPNet), which adds an additional three steps to the ProtoPNet training loop. The first two steps are reward-based reweighting and reselection, which align prototypes with human feedback. The final step is retraining to realign the model's features with the updated prototypes. We find that R3-ProtoPNet improves the overall consistency and meaningfulness of the prototypes, but lower the test predictive accuracy when used independently. When multiple R3-ProtoPNets are incorporated into an ensemble, we find an increase in test predictive performance while maintaining interpretability.
翻译:近年来,研究工作致力于开发用于图像分类的深度可解释方法,这些方法能明确地将模型输出归因于数据的特定特征。其中一种方法是原型部分网络(ProtoPNet),它试图基于输入中有意义的部件对图像进行分类。尽管该方法能实现可解释的分类,但它常常从数据的虚假或不一致部件中学习分类。为弥补这一不足,我们借鉴了基于人类反馈的强化学习(RLHF)的最新进展,对这些原型进行微调。通过在CUB-200-2011数据集上收集关于原型质量的1-5分制人工标注,我们构建了一个奖励模型,该模型能学习识别非虚假原型。我们提出重加权、重选择和再训练的原型部分网络(R³-ProtoPNet),该网络在ProtoPNet训练循环中增加了三个额外步骤,以替代完整的强化学习更新。前两个步骤是基于奖励的重加权和重选择,使原型与人类反馈对齐;最后一步是再训练,使模型特征与更新后的原型重新对齐。我们发现R³-ProtoPNet能提升原型的一致性和有意义性,但独立使用时会导致测试预测精度下降。当多个R³-ProtoPNet集成使用时,在保持可解释性的同时,测试预测性能得以提升。