In recent years, work has gone into developing deep interpretable methods for image classification that clearly attributes a model's output to specific features of the data. One such of these methods is the Prototypical Part Network (ProtoPNet), which attempts to classify images based on meaningful parts of the input. While this method results in interpretable classifications, it often learns to classify from spurious or inconsistent parts of the image. Hoping to remedy this, we take inspiration from the recent developments in Reinforcement Learning with Human Feedback (RLHF) to fine-tune these prototypes. By collecting human annotations of prototypes quality via a 1-5 scale on the CUB-200-2011 dataset, we construct a reward model that learns human preferences and identify non-spurious prototypes. In place of a full RL update, we propose the Reweighed, Reselected, and Retrained Prototypical Part Network (R3-ProtoPNet), which adds an additional three steps to the ProtoPNet training loop. The first two steps are reward-based reweighting and reselection, which align prototypes with human feedback. The final step is retraining to realign the model's features with the updated prototypes. We find that R3-ProtoPNet improves the overall meaningfulness of the prototypes, and maintains or improves individual model performance. When multiple trained R3-ProtoPNets are incorporated into an ensemble, we find increases in both interpretability and predictive performance.
翻译:近年来,研究者致力于开发图像分类的深度可解释方法,这些方法能将模型输出清晰归因于数据的特定特征。原型部件网络(ProtoPNet)便是此类方法之一,它试图基于输入中有意义的部件对图像进行分类。尽管该方法能产生可解释的分类结果,但其学习过程常依赖于虚假或不一致的图像部件。为解决这一问题,我们借鉴人类反馈强化学习(RLHF)的最新进展对原型进行微调。通过在CUB-200-2011数据集上收集人类对原型质量的1-5级标注,我们构建了一个学习人类偏好并识别非虚假原型的奖励模型。无需完整的强化学习更新,我们提出加权、重新选择与再训练原型部件网络(R³-ProtoPNet),在ProtoPNet训练循环中增加三个步骤:前两步是基于奖励的加权与重新选择,使原型与人类反馈对齐;最后一步是再训练,将模型特征与更新后的原型重新对齐。实验表明,R³-ProtoPNet能提升原型的整体语义有效性,同时维持或提升单模型性能。当多个训练好的R³-ProtoPNet集成使用时,模型的可解释性与预测性能均得到提升。