Binary classification involves predicting the label of an instance based on whether the model score for the positive class exceeds a threshold chosen based on the application requirements (e.g., maximizing recall for a precision bound). However, model scores are often not aligned with the true positivity rate. This is especially true when the training involves a differential sampling across classes or there is distributional drift between train and test settings. In this paper, we provide theoretical analysis and empirical evidence of the dependence of model score estimation bias on both uncertainty and score itself. Further, we formulate the decision boundary selection in terms of both model score and uncertainty, prove that it is NP-hard, and present algorithms based on dynamic programming and isotonic regression. Evaluation of the proposed algorithms on three real-world datasets yield 25%-40% gain in recall at high precision bounds over the traditional approach of using model score alone, highlighting the benefits of leveraging uncertainty.
翻译:二元分类涉及根据模型对正类的得分是否超过基于应用需求选择的阈值(例如,在精度约束下最大化召回率)来预测实例的标签。然而,模型得分通常与真实正例率不一致。当训练涉及跨类别的差异化采样或训练集与测试集之间存在分布漂移时,这一问题尤为突出。本文从理论和实证两方面分析了模型得分估计偏差对不确定性和得分本身的依赖性。进一步,我们基于模型得分和不确定性共同制定决策边界选择问题,证明该问题为NP难问题,并提出基于动态规划和等渗回归的算法。在三个真实数据集上评估所提算法,与传统仅使用模型得分的方法相比,在高精度约束下召回率提升25%-40%,突显了利用不确定性的优势。