Can we preserve the accuracy of neural models while also providing faithful explanations? We present wrapper boxes, a general approach to generate faithful, example-based explanations for model predictions while maintaining predictive performance. After training a neural model as usual, its learned feature representation is input to a classic, interpretable model to perform the actual prediction. This simple strategy is surprisingly effective, with results largely comparable to those of the original neural model, as shown across three large pre-trained language models, two datasets of varying scale, four classic models, and four evaluation metrics. Moreover, because these classic models are interpretable by design, the subset of training examples that determine classic model predictions can be shown directly to users.
翻译:我们能否在保持神经网络模型准确性的同时提供忠实的解释?本文提出包装箱方法——一种通用方案,能在维持预测性能的前提下,为模型预测生成基于示例的忠实解释。按常规流程训练神经网络后,将其学习到的特征表示输入经典可解释模型以执行实际预测。这一简单策略出人意料地有效:经三个大型预训练语言模型、两个不同规模数据集、四种经典模型及四项评估指标的验证,其效果与原神经网络模型基本相当。此外,由于这些经典模型本身具备可解释性设计,用户可直接观察到决定模型预测的训练示例子集。