Structured additive distributional copula regression allows to model the joint distribution of multivariate outcomes by relating all distribution parameters to covariates. Estimation via statistical boosting enables accounting for high-dimensional data and incorporating data-driven variable selection, both of which are useful given the complexity of the model class. However, as known from univariate (distributional) regression, the standard boosting algorithm tends to select too many variables with minor importance, particularly in settings with large sample sizes, leading to complex models with difficult interpretation. To counteract this behavior and to avoid selecting base-learners with only a negligible impact, we combined the ideas of probing, stability selection and a new deselection approach with statistical boosting for distributional copula regression. In a simulation study and an application to the joint modelling of weight and length of newborns, we found that all proposed methods enhance variable selection by reducing the number of false positives. However, only stability selection and the deselection approach yielded similar predictive performance to classical boosting. Finally, the deselection approach is better scalable to larger datasets and led to a competitive predictive performance, which we further illustrated in a genomic cohort study from the UK Biobank by modelling the joint genetic predisposition for two phenotypes.
翻译:结构化加性分布Copula回归通过将所有分布参数与协变量相关联,能够对多元结果的联合分布进行建模。基于统计提升的估计方法可以处理高维数据并纳入数据驱动的变量选择,鉴于模型类别的复杂性,这两点均具有实用价值。然而,正如单变量(分布)回归中已知的情况,标准提升算法倾向于选择过多重要性较低的变量,尤其是在大样本量情境下,这会导致模型复杂且难以解释。为抑制此现象并避免选择影响可忽略的基础学习器,我们将探测法、稳定性选择及一种新的剔除选择方法与分布Copula回归的统计提升相结合。在模拟研究及针对新生儿体重与身长联合建模的应用中,我们发现所有提出的方法均通过减少误报数量来增强变量选择效果。然而,仅有稳定性选择与剔除选择方法取得了与经典提升相当的预测性能。最终,剔除选择方法能够更好地扩展到更大规模的数据集,并实现了具有竞争力的预测性能,我们通过在UK Biobank的基因组队列研究中建模两种表型的联合遗传易感性,进一步验证了该方法的优势。