Batch active learning is a popular approach for efficiently training machine learning models on large, initially unlabelled datasets, which repeatedly acquires labels for a batch of data points. However, many recent batch active learning methods are white-box approaches limited to differentiable parametric models: they score unlabeled points using acquisition functions based on model embeddings or first- and second-order derivatives. In this paper, we propose black-box batch active learning for regression tasks as an extension of white-box approaches. This approach is compatible with a wide range of machine learning models including regular and Bayesian deep learning models and non-differentiable models such as random forests. It is rooted in Bayesian principles and utilizes recent kernel-based approaches. Importantly, our method only relies on model predictions. This allows us to extend a wide range of existing state-of-the-art white-box batch active learning methods (BADGE, BAIT, LCMD) to black-box models. We demonstrate the effectiveness of our approach through extensive experimental evaluations on regression datasets, achieving surprisingly strong performance compared to white-box approaches for deep learning models.
翻译:批量主动学习是一种高效训练机器学习模型的流行方法,适用于大规模初始未标注数据集,通过反复获取一批数据点的标签。然而,近期许多批量主动学习方法仅限于可微分参数化模型的白盒方法:它们基于模型嵌入或一阶、二阶导数,利用采集函数对未标注点进行评分。本文提出将白盒方法扩展至回归任务的黑盒批量主动学习。该方法兼容多种机器学习模型,包括常规与贝叶斯深度学习模型,以及随机森林等非可微模型。其根植于贝叶斯原理,并利用近期基于核的方法。关键在于,我们的方法仅依赖模型预测结果,从而可将多种现有最优白盒批量主动学习方法(如BADGE、BAIT、LCMD)扩展至黑盒模型。通过在回归数据集上的广泛实验评估,我们验证了该方法的效果,在深度学习模型中取得了相较于白盒方法令人惊讶的强劲性能。