An important issue in many multivariate regression problems is to eliminate candidate predictors with null predictor vectors. In large-dimensional (LD) setting where the numbers of responses and predictors are large, model selection encounters the scalability challenge. Knock-one-out (KOO) statistics hold promise to meet this challenge. In this paper, the almost sure limits and the central limit theorem of the KOO statistics are derived under the LD setting and mild distributional assumptions (finite fourth moments) of the errors. These theoretical results guarantee the strong consistency of a subset selection rule based on the KOO statistics with a general threshold. For enhancing the robustness of the selection rule, we also propose a bootstrap threshold for the KOO approach. Simulation results support our conclusions and demonstrate the selection probabilities by the KOO approach with the bootstrap threshold outperform the methods using Akaike information threshold, Bayesian information threshold and Mallow's C$_p$ threshold. We compare the proposed KOO approach with those based on information threshold to a chemometrics dataset and a yeast cell-cycle dataset, which suggests our proposed method identifies useful models.
翻译:在许多多元回归问题中,一个关键议题是剔除预测向量为零的候选预测变量。在大规模(LD)设定下,当响应变量和预测变量数量均很大时,模型选择面临可扩展性挑战。留一剔除(KOO)统计量有望应对这一挑战。本文在LD设定及误差项满足温和分布假设(有限四阶矩)的条件下,推导了KOO统计量的几乎必然极限与中心极限定理。这些理论结果保证了基于KOO统计量及一般阈值的子集选择规则具有强相合性。为增强选择规则的稳健性,我们还提出了适用于KOO方法的自助阈值。模拟结果支持我们的结论,并表明采用自助阈值的KOO方法在变量选择概率上优于使用赤池信息阈值、贝叶斯信息阈值以及Mallow's C$_p$阈值的方法。我们将所提出的KOO方法与基于信息阈值的方法应用于一个化学计量学数据集和一个酵母细胞周期数据集进行对比,结果表明我们的方法能够识别出有效模型。