We address a classical problem in statistics: adding two-way interaction terms to a regression model. As the covariate dimension increases quadratically, we develop an estimator that adapts well to this increase, while providing accurate estimates and appropriate inference. Existing strategies overcome the dimensionality problem by only allowing interactions between relevant main effects. Building on this philosophy, we implement a softer link between the two types of effects using a local shrinkage model. We empirically show that borrowing strength between the amount of shrinkage for main effects and their interactions can strongly improve estimation of the regression coefficients. Moreover, we evaluate the potential of the model for inference, which is notoriously hard for selection strategies. Large-scale cohort data are used to provide realistic illustrations and evaluations. Comparisons with other methods are provided. The evaluation of variable importance is not trivial in regression models with many interaction terms. Therefore, we derive a new analytical formula for the Shapley value, which enables rapid assessment of individual-specific variable importance scores and their uncertainties. Finally, while not targeting for prediction, we do show that our models can be very competitive to a more advanced machine learner, like random forest, even for fairly large sample sizes. The implementation of our method in RStan is fairly straightforward, allowing for adjustments to specific needs.
翻译:我们探讨了统计学中的一个经典问题:在回归模型中添加双向交互项。随着协变量维度的二次增长,我们提出了一种能够良好适应这种增长的估计器,同时提供准确的估计和适当的推断。现有策略通过只允许相关主效应之间的交互来克服维度问题。基于这一理念,我们利用局部收缩模型在两类效应之间实现更柔性的连接。我们通过实证表明,在主效应及其交互的收缩量之间借用强度可以显著改进回归系数的估计。此外,我们评估了该模型在推断方面的潜力——这通常是选择策略的难题。我们使用大规模队列数据提供真实场景的示例和评估,并与其他方法进行了比较。在包含大量交互项的回归模型中,变量重要性的评估并非易事。因此,我们推导出Shapley值的一个新解析公式,能够快速评估个体特定的变量重要性得分及其不确定性。最后,尽管我们的方法并非以预测为目标,但我们确实展示了即使对于相当大样本量的情况,我们的模型也能与更先进的机器学习方法(如随机森林)保持高度竞争力。我们方法在RStan中的实现相当直接,可根据特定需求进行调整。