For regression tasks one often leverages large datasets for training predictive machine learning models. However, using large datasets may not be feasible due to computational limitations or high data labelling costs. Therefore, suitably selecting small training sets from large pools of unlabelled data points is essential to maximize model performance while maintaining efficiency. In this work, we study Farthest Point Sampling (FPS), a data selection approach that aims to minimize the fill distance of the selected set. We derive an upper bound for the maximum expected prediction error, conditional to the location of the unlabelled data points, that linearly depends on the training set fill distance. For empirical validation, we perform experiments using two regression models on three datasets. We empirically show that selecting a training set by aiming to minimize the fill distance, thereby minimizing our derived bound, significantly reduces the maximum prediction error of various regression models, outperforming alternative sampling approaches by a large margin. Furthermore, we show that selecting training sets with the FPS can also increase model stability for the specific case of Gaussian kernel regression approaches.
翻译:对于回归任务,通常需要借助大规模数据集来训练预测性机器学习模型。然而,由于计算资源限制或数据标注成本高昂,使用大型数据集可能并不可行。因此,从大量未标注数据点中恰当选取小型训练集,对于在保持效率的同时最大化模型性能至关重要。本研究聚焦最远点采样(FPS)这一数据选择方法,其目标是最小化所选集合的填充距离。我们推导出最大期望预测误差的一个上界,该上界以未标注数据点的位置为条件,且与训练集填充距离呈线性相关。为进行实证验证,我们在三个数据集上使用两种回归模型开展实验。实证结果表明,通过以最小化填充距离为目标来选择训练集(从而最小化我们推导出的上界),能显著降低多种回归模型的最大预测误差,且性能大幅优于其他采样方法。此外,我们证明在特定高斯核回归方法中,采用FPS选择训练集还能提升模型稳定性。