The acquisition of labels for supervised learning can be expensive. In order to improve the sample-efficiency of neural network regression, we study active learning methods that adaptively select batches of unlabeled data for labeling. We present a framework for constructing such methods out of (network-dependent) base kernels, kernel transformations and selection methods. Our framework encompasses many existing Bayesian methods based on Gaussian Process approximations of neural networks as well as non-Bayesian methods. Additionally, we propose to replace the commonly used last-layer features with sketched finite-width Neural Tangent Kernels, and to combine them with a novel clustering method. To evaluate different methods, we introduce an open-source benchmark consisting of 15 large tabular regression data sets. Our proposed method outperforms the state-of-the-art on our benchmark, scales to large data sets, and works out-of-the-box without adjusting the network architecture or training code. We provide open-source code that includes efficient implementations of all kernels, kernel transformations, and selection methods, and can be used for reproducing our results.
翻译:有监督学习中的标签获取可能成本高昂。为提升神经网络回归的样本效率,我们研究了自适应选择未标记数据批次进行标注的主动学习方法。我们提出了一种框架,该框架可通过(网络相关的)基础核函数、核变换及选择方法来构建此类方法。该框架囊括了众多基于高斯过程逼近神经网络的贝叶斯方法,以及非贝叶斯方法。此外,我们提出用基于草图构建的有限宽度神经正切核替代常用的最后一层特征,并将其与一种新型聚类方法相结合。为评估不同方法,我们引入了一个包含15个大型表格回归数据集的开放基准。我们提出的方法在该基准上超越了现有最优方法,可扩展至大型数据集,且无需调整网络架构或训练代码即可直接使用。我们提供了包含所有核函数、核变换及选择方法高效实现的开源代码,可用于复现实验结果。