The acquisition of labels for supervised learning can be expensive. To improve the sample efficiency of neural network regression, we study active learning methods that adaptively select batches of unlabeled data for labeling. We present a framework for constructing such methods out of (network-dependent) base kernels, kernel transformations, and selection methods. Our framework encompasses many existing Bayesian methods based on Gaussian process approximations of neural networks as well as non-Bayesian methods. Additionally, we propose to replace the commonly used last-layer features with sketched finite-width neural tangent kernels and to combine them with a novel clustering method. To evaluate different methods, we introduce an open-source benchmark consisting of 15 large tabular regression data sets. Our proposed method outperforms the state-of-the-art on our benchmark, scales to large data sets, and works out-of-the-box without adjusting the network architecture or training code. We provide open-source code that includes efficient implementations of all kernels, kernel transformations, and selection methods, and can be used for reproducing our results.
翻译:监督学习的标签获取成本可能很高。为提高神经网络回归的样本效率,我们研究自适应选择未标注数据批次进行标注的主动学习方法。我们提出一个基于(网络依赖的)基础核、核变换和选择方法构建此类方法的框架。该框架涵盖了多种基于高斯过程近似神经网络的现有贝叶斯方法以及非贝叶斯方法。此外,我们提出用草图化有限宽度神经正切核替代常用的末层特征,并将其与新型聚类方法结合。为评估不同方法,我们引入包含15个大型表格回归数据集的开放基准。我们提出的方法在该基准上超越了当前最优水平,可扩展至大规模数据集,且无需调整网络架构或训练代码即可开箱即用。我们提供包含所有核函数、核变换和选择方法高效实现的开源代码,可用于复现实验结果。