For sparse high-dimensional regression problems, Cox and Battey [1, 9] emphasised the need for confidence sets of models: an enumeration of those small sets of variables that fit the data equivalently well in a suitable statistical sense. This is to be contrasted with the single model returned by penalised regression procedures, effective for prediction but potentially misleading for subject-matter understanding. The proposed construction of such sets relied on preliminary reduction of the full set of variables, and while various possibilities could be considered for this, [9] proposed a succession of regression fits based on incomplete block designs. The purpose of the present paper is to provide insight on both aspects of that work. For an unspecified reduction strategy, we begin by characterising models that are likely to be retained in the model confidence set, emphasising geometric aspects. We then evaluate possible reduction schemes based on penalised regression or marginal screening, before theoretically elucidating the reduction of [9]. We identify features of the covariate matrix that may reduce its efficacy, and indicate improvements to the original proposal. An advantage of the approach is its ability to reveal its own stability or fragility for the data at hand.
翻译:针对稀疏高维回归问题,Cox和Battey [1, 9]强调了构建模型置信集的必要性:即枚举那些在适当统计意义上拟合数据同等良好的小型变量子集。这与惩罚回归方法返回单一模型的做法形成对比——后者虽在预测方面有效,却可能误导对研究问题的实质性理解。此类置信集的构建依赖于对全变量集的初步约简,尽管可考虑多种约简方案,[9]提出了一种基于不完全区组设计的连续回归拟合方法。本文旨在为这两方面工作提供理论洞见。针对未具体说明的约简策略,我们首先从几何特性角度刻画可能被保留在模型置信集中的模型特征。随后评估基于惩罚回归或边际筛选的可行约简方案,并从理论上阐释[9]提出的约简方法。我们识别出协变量矩阵中可能降低约简效能的特征,并提出对原始方案的改进方向。该方法的一个优势在于能够揭示其对手头数据的稳定性或脆弱性。