Modern machine learning has achieved remarkable success on many problems, but this success often depends on the existence of large, labeled datasets. While active learning can dramatically reduce labeling cost when annotations are expensive, early performance is frequently dominated by the initial seed set, typically chosen at random. In many applications, however, related or approximate datasets are readily available and can be leveraged to construct a better seed set. We introduce a new method for selecting the seed data set for active learning, Active-Transfer Bagging (ATBagging). ATBagging estimates the informativeness of candidate data point from a Bayesian interpretation of bagged ensemble models by comparing in-bag and out-of-bag predictive distributions from the labeled dataset, yielding an information-gain proxy. To avoid redundant selections, we impose feature-space diversity by sampling a determinantal point process (DPP) whose kernel uses Random Fourier Features and a quality-diversity factorization that incorporates the informativeness scores. This same blended method is used for selection of new data points to collect during the active learning phase. We evaluate ATBagging on four real-world datasets covering both target-transfer and feature-shift scenarios (QM9, ERA5, Forbes 2000, and Beijing PM2.5). Across seed sizes nseed = 10-100, ATBagging improves or ties early active learning and increases area under the learning-curve relative to alternative seed subset selection methodologies in almost all cases, with strongest benefits in low-data regimes. Thus, ATBagging provides a low-cost, high reward means to initiating active learning-based data collection.
翻译:现代机器学习已在诸多问题上取得显著成功,但这种成功通常依赖于大规模标注数据集的存在。当标注成本高昂时,主动学习可大幅降低标注开销,但其早期性能往往受初始种子集(通常随机选取)的主导。然而,在许多实际应用中,相关或近似数据集易于获取,可用于构建更优的种子集。本文提出一种用于主动学习种子数据集选择的新方法——主动迁移装袋(ATBagging)。ATBagging通过比较标注数据集的袋内与袋外预测分布,基于装袋集成模型的贝叶斯解释来估计候选数据点的信息量,从而构建信息增益代理指标。为避免冗余选择,我们通过采样行列式点过程(DPP)施加特征空间多样性约束,该过程的核函数采用随机傅里叶特征及融合信息量得分的质量-多样性分解因子。这种混合方法同样适用于主动学习阶段新数据点的选择。我们在涵盖目标迁移与特征偏移场景的四个真实数据集(QM9、ERA5、福布斯2000和北京PM2.5)上评估ATBagging。在种子规模nseed = 10-100的范围内,ATBagging在几乎所有案例中均优于或持平早期主动学习性能,并较其他种子子集选择方法提升了学习曲线下面积,其在低数据区域的优势尤为显著。因此,ATBagging为启动基于主动学习的数据采集提供了一种低成本、高收益的途径。