For sparse high-dimensional regression problems, Cox and Battey [1, 9] emphasised the need for confidence sets of models: an enumeration of those small sets of variables that fit the data equivalently well in a suitable statistical sense. This is to be contrasted with the single model returned by penalised regression procedures, effective for prediction but potentially misleading for subject-matter understanding. The proposed construction of such sets relied on preliminary reduction of the full set of variables, and while various possibilities could be considered for this, [9] proposed a succession of regression fits based on incomplete block designs. The purpose of the present paper is to provide insight on both aspects of that work. For an unspecified reduction strategy, we begin by characterising models that are likely to be retained in the model confidence set, emphasising geometric aspects. We then evaluate possible reduction schemes based on penalised regression or marginal screening, before theoretically elucidating the reduction of [9]. We identify features of the covariate matrix that may reduce its efficacy, and indicate improvements to the original proposal. An advantage of the approach is its ability to reveal its own stability or fragility for the data at hand.
翻译:针对稀疏高维回归问题,Cox和Battey [1, 9] 强调了构建模型置信集的必要性——即列举出那些在适当统计意义下能以同等程度拟合数据的变量小子集。这与惩罚回归程序仅返回单一模型的做法形成鲜明对比,后者虽在预测方面表现有效,但可能对实质性理解产生误导。此类集合的构建依赖于对完整变量集进行初步约简,尽管存在多种可能的实现方案,[9] 提出了一种基于不完全区组设计的连续回归拟合方法。本文旨在对该工作的两个方面提供深入洞察。对于未指定具体策略的约简方法,我们首先从几何角度刻画可能保留在模型置信集中的模型特征。继而评估基于惩罚回归或边际筛选的可行约简方案,并从理论上阐明[9]中提出的约简方法。我们识别出协变量矩阵中可能降低其效能的特征,并对原始方案提出改进。该方法的一个优势在于能够揭示其自身针对当前数据的稳定性或脆弱性。