We develop a statistical learning theory for gradient boosting applied to the estimation of covariate-dependent Generalized Pareto (GP) distributions in the context of Peaks-over-Threshold modeling. After an orthogonal reparametrization of the GP likelihood that diagonalizes its Fisher information matrix, we cast the estimation problem within the Empirical Risk Minimization (ERM) framework and derive non-asymptotic error bounds for the boosting estimator. Our analysis accounts for three distinct sources of error in the process: statistical fluctuations, the approximation bias inherent to the asymptotic nature of the GP model-controlled under second-order regular variation-and the approximation error associated with the finite number of boosting iterates, making explicit the resulting bias-variance trade-off. We illustrate the practical benefits of the reparametrization through simulations, showing that it significantly reduces gradient correlation during training and improves convergence stability. The methodology is applied to a medical malpractice insurance dataset from the Texas Department of Insurance, comprising over 18 000 closed claims. The gradient boosting approach yields a good fit for the tail of settlement cost distributions and reveals that the number of days to settlement is the dominant predictor of tail heaviness, consistent with earlier findings in the reserving literature.
翻译:我们针对过阈值建模背景下,结合协变量的广义帕累托分布估计,建立了梯度提升方法的统计学习理论。通过对广义帕累托似然函数进行正交重参数化以对角化其Fisher信息矩阵,我们将估计问题纳入经验风险最小化框架,并推导了提升估计量的非渐近误差界。我们的分析综合考虑了过程中的三种不同误差来源:统计波动、由二阶正则变差控制的广义帕累托模型渐近本质所固有的近似偏差,以及与有限次提升迭代次数相关的近似误差,明确揭示了由此产生的偏差-方差权衡。通过模拟实验,我们验证了重参数化的实际优势,表明其能显著降低训练过程中的梯度相关性并提升收敛稳定性。该方法应用于德克萨斯州保险局包含18,000余例已结案件的内科医疗责任保险数据集。梯度提升方法对和解费用分布的尾部实现了良好拟合,并揭示出结案天数是对尾部厚重性最具主导性的预测变量,这一结论与准备金计量文献中的既有发现一致。