Motivated by challenges in the analysis of biomedical data and observational studies, we develop statistical boosting for the general class of bivariate distributional copula regression with arbitrary marginal distributions, which is suited to model binary, count, continuous or mixed outcomes. In our framework, the joint distribution of arbitrary, bivariate responses is modelled through a parametric copula. To arrive at a model for the entire conditional distribution, not only the marginal distribution parameters but also the copula parameters are related to covariates through additive predictors. We suggest efficient and scalable estimation by means of an adapted component-wise gradient boosting algorithm with statistical models as base-learners. A key benefit of boosting as opposed to classical likelihood or Bayesian estimation is the implicit data-driven variable selection mechanism as well as shrinkage without additional input or assumptions from the analyst. To the best of our knowledge, our implementation is the only one that combines a wide range of covariate effects, marginal distributions, copula functions, and implicit data-driven variable selection. We showcase the versatility of our approach on data from genetic epidemiology, healthcare utilization and childhood undernutrition. Our developments are implemented in the R package gamboostLSS, fostering transparent and reproducible research.
翻译:受生物医学数据及观察性研究分析中的挑战驱动,我们针对具有任意边缘分布的广义二元分布Copula回归类发展统计提升方法。该方法适用于二元、计数、连续或混合结果建模。在我们的框架中,任意二元响应的联合分布通过参数化Copula进行建模。为得到完整的条件分布模型,不仅边缘分布参数,而且Copula参数均通过加性预测器与协变量相关联。我们提出采用基于统计模型作为基学习器的自适应分量梯度提升算法,实现高效且可扩展的估计。相较于经典似然或贝叶斯估计,提升方法的核心优势在于其隐式的数据驱动变量选择机制,以及无需分析者额外输入或假设的收缩特性。据我们所知,本实现是唯一融合广泛协变量效应、边缘分布、Copula函数及隐式数据驱动变量选择的方法。我们通过遗传流行病学、医疗保健利用及儿童营养不良数据展示了方法的普适性。相关研究成果已在R包gamboostLSS中实现,促进透明且可重复的研究。