Sparsity in a regression context makes the model itself an object of interest, pointing to a confidence set of models as the appropriate presentation of evidence. A difficulty in areas such as genomics, where the number of candidate variables is vast, arises from the need for preliminary reduction prior to the assessment of models. The present paper considers a resolution using inferential separations fundamental to the Fisherian approach to conditional inference, namely, the sufficiency/co-sufficiency separation, and the ancillary/co-ancillary separation. The advantage of these separations is that no direction for departure from any hypothesised model is needed, avoiding issues that would otherwise arise from using the same data for reduction and for model assessment. In idealised cases with no nuisance parameters, the separations extract all the information in the data solely for the purpose for which it is useful, without loss or redundancy. The extent to which estimation of nuisance parameters affects the idealised information extraction is illustrated in detail for the normal-theory linear regression model, extending immediately to a log-normal accelerated-life model for time-to-event outcomes. This idealised analysis provides insight into when sample-splitting is likely to perform as well as, or better than, the co-sufficient or ancillary tests, and when it may be unreliable. The considerations involved in extending the detailed implementation to canonical exponential-family and more general regression models are briefly discussed. As part of the analysis for the Gaussian model, we introduce a modified version of the refitted cross-validation estimator of Fan et al. (2012), whose distribution theory is tractable in the appropriate conditional sense.
翻译:在回归分析中,稀疏性使得模型本身成为关注对象,因此将模型置信集作为证据的恰当呈现方式。在诸如基因组学等候选变量数量庞大的领域中,一个困难在于评估模型之前需要进行初步约简。本文利用费希尔条件推断方法中的基本推断分离——充分性/共充分性分离以及辅助性/共辅助性分离——提出了一种解决方案。这些分离的优势在于无需指定任何假设模型的偏离方向,从而避免了因使用相同数据进行约简和模型评估而产生的问题。在无冗余参数的情况下,这些分离能理想地提取数据中所有仅对特定目的有用的信息,既无损失也无冗余。本文以正态理论线性回归模型为例,详细说明了冗余参数估计对理想化信息提取的影响,并可直接推广至事件时间结果的加速寿命模型。这一理想化分析揭示了样本分割在何种情况下可能表现与共充分性或辅助性检验相当或更优,以及在何种情况下可能不可靠。本文简要讨论了将详细实现方法扩展至典型指数族及更一般回归模型时需考虑的因素。作为高斯模型分析的一部分,我们引入了Fan等人(2012)提出的重拟合交叉验证估计量的修正版本,该估计量的分布理论在适当的条件意义下易于处理。