Many machine learning regression methods leverage large datasets for training predictive models. However, using large datasets may not be feasible due to computational limitations or high labelling costs. Therefore, sampling small training sets from large pools of unlabelled data points is essential to maximize model performance while maintaining computational efficiency. In this work, we study a sampling approach aimed to minimize the fill distance of the selected set. We derive an upper bound for the maximum expected prediction error that linearly depends on the training set fill distance, conditional to the knowledge of data features. For empirical validation, we perform experiments using two regression models on two datasets. We empirically show that selecting a training set by aiming to minimize the fill distance, thereby minimizing the bound, significantly reduces the maximum prediction error of various regression models, outperforming existing sampling approaches by a large margin.
翻译:许多机器学习回归方法依赖大规模数据集训练预测模型。然而,由于计算限制或标注成本高昂,使用大规模数据集可能不可行。因此,从大量未标注数据点中采样小型训练集,对于在保持计算效率的同时最大化模型性能至关重要。本文研究了一种旨在最小化所选集合填充距离的采样方法。我们推导了最大期望预测误差的上界,该上界线性依赖于训练集填充距离,且条件取决于数据特征的知识。为进行实证验证,我们在两个数据集上使用两种回归模型开展实验。实验结果表明,通过最小化填充距离来选择训练集(从而最小化该上界)能显著降低多种回归模型的最大预测误差,其性能大幅优于现有采样方法。