It is not an exaggeration to say that the recent progress in artificial intelligence technology depends on large-scale and high-quality data. Simultaneously, a prevalent issue exists everywhere: the budget for data labeling is constrained. Active learning is a prominent approach for addressing this issue, where valuable data for labeling is selected through a model and utilized to iteratively adjust the model. However, due to the limited amount of data in each iteration, the model is vulnerable to bias; thus, it is more likely to yield overconfident predictions. In this paper, we present two novel methods to address the problem of overconfidence that arises in the active learning scenario. The first is an augmentation strategy named Cross-Mix-and-Mix (CMaM), which aims to calibrate the model by expanding the limited training distribution. The second is a selection strategy named Ranked Margin Sampling (RankedMS), which prevents choosing data that leads to overly confident predictions. Through various experiments and analyses, we are able to demonstrate that our proposals facilitate efficient data selection by alleviating overconfidence, even though they are readily applicable.
翻译:毫不夸张地说,人工智能技术的近期进展依赖于大规模、高质量的数据。同时,一个普遍存在的问题无处不在:数据标注的预算受限。主动学习是解决该问题的重要方法,通过模型筛选出有价值的数据进行标注,并用于迭代调整模型。然而,由于每次迭代中数据量有限,模型易受偏差影响,从而更可能产生过度自信的预测。在本文中,我们提出了两种新颖方法来应对主动学习场景中出现的过度自信问题。第一种是名为交叉混合与混合(CMaM)的数据增强策略,旨在通过扩展有限的训练分布来校准模型。第二种是名为排序边际采样(RankedMS)的选择策略,用于防止选取导致过度自信预测的数据。通过多种实验与分析,我们能够证明,尽管所提方法易于应用,但通过缓解过度自信,它们促进了高效的数据选择。