Quantification is a supervised machine learning task, focused on estimating the class prevalence of a dataset rather than labeling its individual observations. We introduce Continuous Sweep, a new parametric binary quantifier inspired by the well-performing Median Sweep. Median Sweep is currently one of the best binary quantifiers, but we have changed this quantifier on three points, namely 1) using parametric class distributions instead of empirical distributions, 2) optimizing decision boundaries instead of applying discrete decision rules, and 3) calculating the mean instead of the median. We derive analytic expressions for the bias and variance of Continuous Sweep under general model assumptions. This is one of the first theoretical contributions in the field of quantification learning. Moreover, these derivations enable us to find the optimal decision boundaries. Finally, our simulation study shows that Continuous Sweep outperforms Median Sweep in a wide range of situations.
翻译:量化是一种监督式机器学习任务,其目标在于估计数据集中的类别分布,而非对单个观测值进行标注。我们提出了一种名为“连续扫描”的新型参数化二元量化器,其灵感来自于性能优异的“中位数扫描”。中位数扫描当前被视为最佳二元量化器之一,但我们在三个方面对其进行了改进:1)使用参数化类别分布替代经验分布,2)优化决策边界而非应用离散决策规则,3)计算均值而非中位数。我们在一般模型假设下推导了连续扫描的偏差和方差的解析表达式。这是量化学习领域最早的理论贡献之一。此外,这些推导使我们能够找到最优决策边界。最后,我们的模拟研究表明,连续扫描在广泛的情景中优于中位数扫描。