When multitudes of features can plausibly be associated with a response, both privacy considerations and model parsimony suggest grouping them to increase the predictive power of a regression model. Specifically, the identification of groups of predictors significantly associated with the response variable eases further downstream analysis and decision-making. This paper proposes a new data analysis methodology that utilizes the high-dimensional predictor space to construct an implicit network with weighted edges %and weights on the edges to identify significant associations between the response and the predictors. Using a population model for groups of predictors defined via network-wide metrics, a new supervised grouping algorithm is proposed to determine the correct group, with probability tending to one as the sample size diverges to infinity. For this reason, we establish several theoretical properties of the estimates of network-wide metrics. A novel model-assisted bootstrap procedure that substantially decreases computational complexity is developed, facilitating the assessment of uncertainty in the estimates of network-wide metrics. The proposed methods account for several challenges that arise in the high-dimensional data setting, including (i) a large number of predictors, (ii) uncertainty regarding the true statistical model, and (iii) model selection variability. The performance of the proposed methods is demonstrated through numerical experiments, data from sports analytics, and breast cancer data.
翻译:当大量特征与响应变量可能存在关联时,隐私保护和模型简约性均要求对预测变量进行分组,以提升回归模型的预测能力。具体而言,识别与响应变量显著相关的预测变量组,有助于简化后续下游分析与决策过程。本文提出一种新型数据分析方法,利用高维预测变量空间构建带权边的隐式网络,通过识别响应变量与预测变量间的显著关联实现分组。基于网络度量定义的预测变量组群体模型,本文提出一种新的有监督分组算法,该算法能够在样本量趋于无穷时以趋近于1的概率识别正确分组。为此,我们建立了网络度量估计量的若干理论性质。为降低计算复杂度并评估网络度量估计的不确定性,我们开发了一种新型模型辅助自助法。所提方法解决了高维数据场景中的多重挑战,包括:(i) 海量预测变量;(ii) 真实统计模型的不确定性;(iii) 模型选择的变异性。通过数值实验、体育分析数据及乳腺癌数据验证了方法的有效性。