Abstaining classifiers have the option to abstain from making predictions on inputs that they are unsure about. These classifiers are becoming increasingly popular in high-stake decision-making problems, as they can withhold uncertain predictions to improve their reliability and safety. When evaluating black-box abstaining classifier(s), however, we lack a principled approach that accounts for what the classifier would have predicted on its abstentions. These missing predictions are crucial when, e.g., a radiologist is unsure of their diagnosis or when a driver is inattentive in a self-driving car. In this paper, we introduce a novel approach and perspective to the problem of evaluating and comparing abstaining classifiers by treating abstentions as missing data. Our evaluation approach is centered around defining the counterfactual score of an abstaining classifier, defined as the expected performance of the classifier had it not been allowed to abstain. We specify the conditions under which the counterfactual score is identifiable: if the abstentions are stochastic, and if the evaluation data is independent of the training data (ensuring that the predictions are missing at random), then the score is identifiable. Note that, if abstentions are deterministic, then the score is unidentifiable because the classifier can perform arbitrarily poorly on its abstentions. Leveraging tools from observational causal inference, we then develop nonparametric and doubly robust methods to efficiently estimate this quantity under identification. Our approach is examined in both simulated and real data experiments.
翻译:弃权分类器可以选择对其不确定的输入不进行预测。这类分类器在高风险决策问题中越来越受欢迎,因为它们可以保留不确定的预测,从而提高可靠性和安全性。然而,在评估黑盒弃权分类器时,我们缺乏一种原则性的方法来解释分类器在其弃权情况下本应做出的预测。这些缺失的预测在以下情况中至关重要,例如当放射科医生对自己的诊断不确定,或当自动驾驶汽车中的驾驶员注意力不集中时。在本文中,我们通过将弃权视为缺失数据,引入了一种评估和比较弃权分类器的新方法和视角。我们的评估方法核心在于定义弃权分类器的反事实得分,即如果分类器不被允许弃权时的预期性能。我们指定了反事实得分可识别的条件:如果弃权是随机的,并且评估数据独立于训练数据(确保预测是随机缺失的),那么该得分是可识别的。注意,如果弃权是确定性的,则该得分不可识别,因为分类器可以在其弃权的情况下表现任意差。利用观察性因果推断的工具,我们随后开发了非参数和双稳健方法,以在可识别条件下高效估计该量。我们的方法在模拟和真实数据实验中均进行了检验。