Uncertainty sampling is a prevalent active learning algorithm that queries sequentially the annotations of data samples which the current prediction model is uncertain about. However, the usage of uncertainty sampling has been largely heuristic: There is no consensus on the proper definition of ``uncertainty'' for a specific task under a specific loss, nor a theoretical guarantee that prescribes a standard protocol to implement the algorithm. In this work, we systematically examine uncertainty sampling algorithms in the binary classification problem via a notion of equivalent loss which depends on the used uncertainty measure and the original loss function, and establish that an uncertainty sampling algorithm is optimizing against such an equivalent loss. The perspective verifies the properness of existing uncertainty measures from two aspects: surrogate property and loss convexity. When the convexity is preserved, we give a sample complexity result for the equivalent loss, and later translate it into a binary loss guarantee via the surrogate link function. We prove the asymptotic superiority of the uncertainty sampling against the passive learning via this approach under mild conditions. We also discuss some potential extensions, including pool-based setting and potential generalization to the multi-class classification as well as the regression problems.
翻译:不确定性采样是一种流行的主动学习算法,其顺序查询当前预测模型不确定的数据样本的标注。然而,不确定性采样的使用在很大程度上是启发式的:对于特定任务在特定损失下“不确定性”的恰当定义尚无共识,也没有理论保证来规定实现该算法的标准协议。在本文中,我们通过等效损失的概念系统地研究了二分类问题中的不确定性采样算法,该等效损失依赖于所使用的不确定性度量与原始损失函数,并建立了不确定性采样算法正在优化这种等效损失的结论。该视角从替代性质和损失凸性两个方面验证了现有不确定性度量的恰当性。当凸性得以保持时,我们给出了等效损失的样本复杂度结果,随后通过替代链接函数将其转化为二分类损失保证。我们证明了在温和条件下,不确定性采样相对于被动学习在此方法下的渐近优越性。我们还讨论了一些潜在的扩展,包括基于池的设置以及向多分类和回归问题的潜在推广。