Performance of classifiers is often measured in terms of average accuracy on test data. Despite being a standard measure, average accuracy fails in characterizing the fit of the model to the underlying conditional law of labels given the features vector ($Y|X$), e.g. due to model misspecification, over fitting, and high-dimensionality. In this paper, we consider the fundamental problem of assessing the goodness-of-fit for a general binary classifier. Our framework does not make any parametric assumption on the conditional law $Y|X$, and treats that as a black box oracle model which can be accessed only through queries. We formulate the goodness-of-fit assessment problem as a tolerance hypothesis testing of the form \[ H_0: \mathbb{E}\Big[D_f\Big({\sf Bern}(\eta(X))\|{\sf Bern}(\hat{\eta}(X))\Big)\Big]\leq \tau\,, \] where $D_f$ represents an $f$-divergence function, and $\eta(x)$, $\hat{\eta}(x)$ respectively denote the true and an estimate likelihood for a feature vector $x$ admitting a positive label. We propose a novel test, called \grasp for testing $H_0$, which works in finite sample settings, no matter the features (distribution-free). We also propose model-X \grasp designed for model-X settings where the joint distribution of the features vector is known. Model-X \grasp uses this distributional information to achieve better power. We evaluate the performance of our tests through extensive numerical experiments.
翻译:分类器的性能通常通过测试数据上的平均准确率来衡量。尽管这是标准度量,但平均准确率无法表征模型对标签给定特征向量($Y|X$)的条件分布的拟合程度——例如由模型误设、过拟合和高维性所致。本文研究评估通用二分类器拟合优度的基础问题。我们的框架不对条件分布$Y|X$作任何参数化假设,将其视为仅能通过查询访问的黑箱预言机模型。我们将拟合优度评估问题表述为以下形式的容差假设检验:
\[ H_0: \mathbb{E}\Big[D_f\Big({\sf Bern}(\eta(X))\|{\sf Bern}(\hat{\eta}(X))\Big)\Big]\leq \tau\,,\]
其中$D_f$表示$f$散度函数,$\eta(x)$与$\hat{\eta}(x)$分别表示特征向量$x$对应正标签的真实概率与估计概率。我们提出一种名为\grasp的新颖检验方法以检验$H_0$,该方法适用于有限样本场景,且不依赖于特征分布(即分布自由)。针对已知特征向量联合分布的模型X场景,我们还提出模型X \grasp方法,通过利用分布信息实现更高检验功效。我们通过大量数值实验评估了所提检验方法的性能。