Understanding Uncertainty Sampling

Uncertainty sampling is a prevalent active learning algorithm that queries sequentially the annotations of data samples which the current prediction model is uncertain about. However, the usage of uncertainty sampling has been largely heuristic: (i) There is no consensus on the proper definition of "uncertainty" for a specific task under a specific loss; (ii) There is no theoretical guarantee that prescribes a standard protocol to implement the algorithm, for example, how to handle the sequentially arrived annotated data under the framework of optimization algorithms such as stochastic gradient descent. In this work, we systematically examine uncertainty sampling algorithms under both stream-based and pool-based active learning. We propose a notion of equivalent loss which depends on the used uncertainty measure and the original loss function and establish that an uncertainty sampling algorithm essentially optimizes against such an equivalent loss. The perspective verifies the properness of existing uncertainty measures from two aspects: surrogate property and loss convexity. Furthermore, we propose a new notion for designing uncertainty measures called \textit{loss as uncertainty}. The idea is to use the conditional expected loss given the features as the uncertainty measure. Such an uncertainty measure has nice analytical properties and generality to cover both classification and regression problems, which enable us to provide the first generalization bound for uncertainty sampling algorithms under both stream-based and pool-based settings, in the full generality of the underlying model and problem. Lastly, we establish connections between certain variants of the uncertainty sampling algorithms with risk-sensitive objectives and distributional robustness, which can partly explain the advantage of uncertainty sampling algorithms when the sample size is small.

翻译：不确定性采样是一种广泛使用的主动学习算法，它依次查询当前预测模型不确定的数据样本的标注信息。然而，不确定性采样的使用在很大程度上依赖经验：(i) 对于特定任务在特定损失函数下，“不确定性”的正确定义尚未达成共识；(ii) 缺乏理论保证来规定实现该算法的标准协议，例如，在随机梯度下降等优化算法框架下，如何处理顺序到达的标注数据。在这项工作中，我们系统地研究了基于流和基于池的主动学习中的不确定性采样算法。我们提出了一种等价损失的概念，该概念取决于所用的不确定性度量与原始损失函数，并证明了不确定性采样算法本质上是在优化这种等价损失。这一视角从代理性质和损失凸性两个方面验证了现有不确定性度量的恰当性。此外，我们提出了一种设计不确定性度量的新概念，称为“损失即不确定性”。该思想是利用给定特征条件下的条件期望损失作为不确定性度量。这种不确定性度量具有良好的分析性质和普适性，能够涵盖分类和回归问题，从而使得我们能够首次在模型和问题的完全一般性下，为基于流和基于池设置下的不确定性采样算法提供泛化界。最后，我们建立了不确定性采样算法的某些变体与风险敏感目标和分布鲁棒性之间的联系，这可以部分解释当样本量较小时不确定性采样算法的优势。