We conduct a non asymptotic study of the Cross Validation (CV) estimate of the generalization risk for learning algorithms dedicated to extreme regions of the covariates space. In this Extreme Value Analysis context, the risk function measures the algorithm's error given that the norm of the input exceeds a high quantile. The main challenge within this framework is the negligible size of the extreme training sample with respect to the full sample size and the necessity to re-scale the risk function by a probability tending to zero. We open the road to a finite sample understanding of CV for extreme values by establishing two new results: an exponential probability bound on the \Kfold CV error and a polynomial probability bound on the leave-\textrm{p}-out CV. Our bounds are sharp in the sense that they match state-of-the-art guarantees for standard CV estimates while extending them to encompass a conditioning event of small probability. We illustrate the significance of our results regarding high dimensional classification in extreme regions via a Lasso-type logistic regression algorithm. The tightness of our bounds is investigated in numerical experiments.
翻译:本文对面向协变量空间极值区域的学习算法的交叉验证(CV)泛化风险估计进行了非渐近性研究。在此极值分析框架下,风险函数衡量的是在输入向量范数超过高分位数的条件下算法的误差。该框架内的主要挑战在于:极值训练样本规模相对于全样本量可忽略不计,且风险函数需以趋于零的概率进行重新标度。通过建立两项新结果,我们为极值交叉验证的有限样本理解开辟了道路:\K折交叉验证误差的指数概率界与留\textrm{p}交叉验证的多项式概率界。所得界限具有尖锐性,其不仅匹配标准交叉验证估计的最新理论保证,更将其扩展至涵盖小概率条件事件的情形。我们通过Lasso型逻辑回归算法,阐释了本研究对极值区域高维分类问题的重要意义。数值实验验证了所得界限的紧致性。