Understanding how different classes are distributed in an unlabeled data set is an important challenge for the calibration of probabilistic classifiers and uncertainty quantification. Approaches like adjusted classify and count, black-box shift estimators, and invariant ratio estimators use an auxiliary (and potentially biased) black-box classifier trained on a different (shifted) data set to estimate the class distribution and yield asymptotic guarantees under weak assumptions. We demonstrate that all these algorithms are closely related to the inference in a particular Bayesian model, approximating the assumed ground-truth generative process. Then, we discuss an efficient Markov Chain Monte Carlo sampling scheme for the introduced model and show an asymptotic consistency guarantee in the large-data limit. We compare the introduced model against the established point estimators in a variety of scenarios, and show it is competitive, and in some cases superior, with the state of the art.
翻译:摘要:理解无标签数据集中不同类别的分布对于概率分类器的校准和不确定性量化至关重要。诸如校正计数法、黑盒偏移估计器和不变比率估计器等方法,利用在另一(偏移)数据集上训练的辅助性(且可能存在偏差的)黑盒分类器来估计类别分布,并在弱假设下提供渐近保证。我们证明所有这些算法与特定贝叶斯模型中的推断密切相关,该模型近似于假设的真实数据生成过程。随后,我们针对所引入的模型讨论了一种高效的马尔可夫链蒙特卡洛采样方案,并展示了在大数据极限下的渐近一致性保证。我们在多种场景下将该模型与现有的点估计器进行比较,结果表明其具有竞争力,且在部分情况下优于当前最先进的方法。