Improving Uncertainty Sampling with Bell Curve Weight Function

Typically, a supervised learning model is trained using passive learning by randomly selecting unlabelled instances to annotate. This approach is effective for learning a model, but can be costly in cases where acquiring labelled instances is expensive. For example, it can be time-consuming to manually identify spam mails (labelled instances) from thousands of emails (unlabelled instances) flooding an inbox during initial data collection. Generally, we answer the above scenario with uncertainty sampling, an active learning method that improves the efficiency of supervised learning by using fewer labelled instances than passive learning. Given an unlabelled data pool, uncertainty sampling queries the labels of instances where the predicted probabilities, p, fall into the uncertainty region, i.e., $p \approx 0.5$. The newly acquired labels are then added to the existing labelled data pool to learn a new model. Nonetheless, the performance of uncertainty sampling is susceptible to the area of unpredictable responses (AUR) and the nature of the dataset. It is difficult to determine whether to use passive learning or uncertainty sampling without prior knowledge of a new dataset. To address this issue, we propose bell curve sampling, which employs a bell curve weight function to acquire new labels. With the bell curve centred at p=0.5, bell curve sampling selects instances whose predicted values are in the uncertainty area most of the time without neglecting the rest. Simulation results show that, most of the time bell curve sampling outperforms uncertainty sampling and passive learning in datasets of different natures and with AUR.

翻译：通常，监督学习模型通过被动学习训练，随机选择未标记实例进行标注。这种方法在学习模型时很有效，但在获取标注实例成本高昂的情况下可能代价较大。例如，在初始数据收集阶段，从涌入收件箱的数千封电子邮件（未标记实例）中手动识别垃圾邮件（标记实例）非常耗时。通常，我们通过不确定性采样来应对上述情况。不确定性采样是一种主动学习方法，它通过使用比被动学习更少的标注实例来提高监督学习的效率。给定一个未标记数据池，不确定性采样会查询预测概率p落在不确定性区域（即 $p \approx 0.5$）的实例的标签。新获取的标签随后被添加到现有标记数据池中以学习新模型。然而，不确定性采样的性能易受不可预测响应区域（AUR）和数据集特性的影响。在缺乏新数据集先验知识的情况下，很难确定使用被动学习还是不确定性采样。为解决这一问题，我们提出了钟形曲线采样，该方法采用钟形曲线权重函数来获取新标签。通过将钟形曲线中心设在 p=0.5，钟形曲线采样在大多数情况下会选择预测值位于不确定性区域的实例，同时不忽略其他区域。仿真结果表明，在大多数情况下，钟形曲线采样在不同特性和AUR的数据集上均优于不确定性采样和被动学习。