The parameters of a machine learning model are typically learned by minimizing a loss function on a set of training data. However, this can come with the risk of overtraining; in order for the model to generalize well, it is of great importance that we are able to find the optimal parameter for the model on the entire population -- not only on the given training sample. In this paper, we construct valid confidence sets for this optimal parameter of a machine learning model, which can be generated using only the training data without any knowledge of the population. We then show that studying the distribution of this confidence set allows us to assign a notion of confidence to arbitrary regions of the parameter space, and we demonstrate that this distribution can be well-approximated using bootstrapping techniques.
翻译:机器学习模型的参数通常通过在训练数据集上最小化损失函数来学习。然而,这可能会带来过拟合的风险;为了使模型具有良好的泛化能力,我们能够找到模型在整个总体(而非仅给定的训练样本)上的最优参数至关重要。在本文中,我们为机器学习模型的最优参数构建了有效的置信集,这些置信集仅使用训练数据即可生成,无需任何关于总体的知识。随后,我们证明研究该置信集的分布能够为参数空间中的任意区域赋予置信度的概念,并展示了该分布可以通过自助法技术得到良好的近似。