Feature attribution methods explain black-box machine learning (ML) models by assigning importance scores to input features. These methods can be computationally expensive for large ML models. To address this challenge, there has been increasing efforts to develop amortized explainers, where a machine learning model is trained to predict feature attribution scores with only one inference. Despite their efficiency, amortized explainers can produce inaccurate predictions and misleading explanations. In this paper, we propose selective explanations, a novel feature attribution method that (i) detects when amortized explainers generate low-quality explanations and (ii) improves these explanations using a technique called explanations with initial guess. Our selective explanation method allows practitioners to specify the fraction of samples that receive explanations with initial guess, offering a principled way to bridge the gap between amortized explainers and their high-quality counterparts.
翻译:特征归因方法通过为输入特征分配重要性分数来解释黑盒机器学习模型。对于大型机器学习模型,这些方法的计算成本可能很高。为应对这一挑战,学界日益致力于开发摊销解释器——通过训练一个机器学习模型,仅需一次推理即可预测特征归因分数。尽管具有高效性,摊销解释器可能产生不准确的预测和误导性解释。本文提出选择性解释这一新颖的特征归因方法,该方法能够(i)检测摊销解释器何时生成低质量解释,并(ii)通过一种称为"带初始猜测的解释"的技术改进这些解释。我们的选择性解释方法允许实践者指定接收带初始猜测解释的样本比例,为弥合摊销解释器与其高质量对应方法之间的差距提供了原则性途径。