Can we preserve the accuracy of neural models while also providing faithful explanations of model decisions to training data? We propose a "wrapper box'' pipeline: training a neural model as usual and then using its learned feature representation in classic, interpretable models to perform prediction. Across seven language models of varying sizes, including four large language models (LLMs), two datasets at different scales, three classic models, and four evaluation metrics, we first show that the predictive performance of wrapper classic models is largely comparable to the original neural models. Because classic models are transparent, each model decision is determined by a known set of training examples that can be directly shown to users. Our pipeline thus preserves the predictive performance of neural language models while faithfully attributing classic model decisions to training data. Among other use cases, such attribution enables model decisions to be contested based on responsible training instances. Compared to prior work, our approach achieves higher coverage and correctness in identifying which training data to remove to change a model decision. To reproduce findings, our source code is online at: https://github.com/SamSoup/WrapperBox.
翻译:我们能否在保持神经网络模型准确性的同时,为模型决策提供对训练数据的忠实解释?我们提出一种“包装盒”流程:先按常规训练神经网络模型,然后将其学习到的特征表示应用于经典可解释模型以执行预测。通过对七个不同规模的语言模型(包括四个大语言模型)、两个不同规模的数据集、三种经典模型和四个评估指标的综合实验,我们首先证明包装经典模型的预测性能与原始神经网络模型基本相当。由于经典模型具有透明性,每个模型决策都由一组已知的训练样本直接确定,这些样本可直接展示给用户。因此,我们的流程在保持神经语言模型预测性能的同时,能够将经典模型决策忠实归因于训练数据。这种归因机制使得模型决策可以根据负责任的训练实例提出异议,这仅是众多应用场景之一。与现有研究相比,我们的方法在识别需要移除哪些训练数据以改变模型决策方面,实现了更高的覆盖率和正确率。为复现研究结果,我们的源代码已发布于:https://github.com/SamSoup/WrapperBox。