The Tweedie exponential dispersion family is a popular choice among many to model insurance losses that consist of zero-inflated semicontinuous data. In such data, it is often important to obtain credibility (inference) of the most important features that describe the endogenous variables. Post-selection inference is the standard procedure in statistics to obtain confidence intervals of model parameters after performing a feature extraction procedure. For a linear model, the lasso estimate often has non-negligible estimation bias for large coefficients corresponding to exogenous variables. To have valid inference on those coefficients, it is necessary to correct the bias of the lasso estimate. Traditional statistical methods, such as hypothesis testing or standard confidence interval construction might lead to incorrect conclusions during post-selection, as they are generally too optimistic. Here we discuss a few methodologies for constructing confidence intervals of the coefficients after feature selection in the Generalized Linear Model (GLM) family with application to insurance data.
翻译:Tweedie指数离散族是建模由零膨胀半连续数据构成的保险损失的常用选择。在此类数据中,获取描述内生变量最重要特征的可信度(推断)通常至关重要。后选择推断是统计学中的标准程序,用于在执行特征提取过程后获取模型参数的置信区间。对于线性模型,lasso估计量通常对与外生变量对应的大系数具有不可忽略的估计偏差。要对这些系数进行有效推断,必须校正lasso估计量的偏差。传统的统计方法,如假设检验或标准置信区间构建,在后选择过程中可能导致错误结论,因为它们通常过于乐观。本文讨论在广义线性模型(GLM)族中进行特征选择后构建系数置信区间的若干方法,并将其应用于保险数据。