We conduct a non asymptotic study of the Cross Validation (CV) estimate of the generalization risk for learning algorithms dedicated to extreme regions of the covariates space. In this Extreme Value Analysis context, the risk function measures the algorithm's error given that the norm of the input exceeds a high quantile. The main challenge within this framework is the negligible size of the extreme training sample with respect to the full sample size and the necessity to re-scale the risk function by a probability tending to zero. We open the road to a finite sample understanding of CV for extreme values by establishing two new results: an exponential probability bound on the \Kfold CV error and a polynomial probability bound on the leave-\textrm{p}-out CV. Our bounds are sharp in the sense that they match state-of-the-art guarantees for standard CV estimates while extending them to encompass a conditioning event of small probability. We illustrate the significance of our results regarding high dimensional classification in extreme regions via a Lasso-type logistic regression algorithm. The tightness of our bounds is investigated in numerical experiments.
翻译:我们对用于协变量空间极端区域的机器学习算法的泛化风险的交叉验证估计进行非渐近研究。在此极值分析背景下,风险函数衡量算法在输入范数超过高分位数时的误差。该框架的主要挑战在于极端训练样本相对于全样本量的规模可忽略不计,以及需用趋于零的概率对风险函数进行重新缩放。我们通过建立两个新结果为极值情形下交叉验证的有限样本理解开辟道路:K折交叉验证误差的指数概率界和留p法交叉验证的多项式概率界。我们的边界是尖锐的,因为它们与标准交叉验证估计的最优保证相匹配,同时将其扩展到包含小概率条件事件的情形。我们通过基于Lasso型逻辑回归算法说明这些结果在高维极端区域分类中的重要性,并通过数值实验研究边界的紧致性。