Batch active learning is a popular approach for efficiently training machine learning models on large, initially unlabelled datasets by repeatedly acquiring labels for batches of data points. However, many recent batch active learning methods are white-box approaches and are often limited to differentiable parametric models: they score unlabeled points using acquisition functions based on model embeddings or first- and second-order derivatives. In this paper, we propose black-box batch active learning for regression tasks as an extension of white-box approaches. Crucially, our method only relies on model predictions. This approach is compatible with a wide range of machine learning models, including regular and Bayesian deep learning models and non-differentiable models such as random forests. It is rooted in Bayesian principles and utilizes recent kernel-based approaches. This allows us to extend a wide range of existing state-of-the-art white-box batch active learning methods (BADGE, BAIT, LCMD) to black-box models. We demonstrate the effectiveness of our approach through extensive experimental evaluations on regression datasets, achieving surprisingly strong performance compared to white-box approaches for deep learning models.
翻译:批量主动学习是一种流行的方法,通过在未标注的大数据集上反复获取数据点批次的标签,从而高效训练机器学习模型。然而,许多近期的批量主动学习方法是白箱方法,且通常局限于可微分的参数化模型:它们使用基于模型嵌入或一阶、二阶导数的采集函数对未标注点进行评分。本文提出了一种面向回归任务的黑箱批量主动学习方法,作为白箱方法的扩展。关键在于,我们的方法仅依赖模型预测。该方法兼容多种机器学习模型,包括常规与贝叶斯深度学习模型,以及随机森林等不可微分模型。它基于贝叶斯原理,并采用近期基于核的方法。这使我们能够将多种现有最先进的白箱批量主动学习方法(如BADGE、BAIT、LCMD)扩展到黑箱模型。通过在回归数据集上的广泛实验评估,我们证明了该方法的有效性,在深度学习模型上取得了与白箱方法相比出乎意料的强劲性能。