Post-selection inference (PoSI) is a statistical technique for obtaining valid confidence intervals and p-values when hypothesis generation and testing use the same source of data. PoSI can be used on a range of popular algorithms including the Lasso. Data carving is a variant of PoSI in which a portion of held out data is combined with the hypothesis generating data at inference time. While data carving has attractive theoretical and empirical properties, existing approaches rely on computationally expensive MCMC methods to carry out inference. This paper's key contribution is to show that pivotal quantities can be constructed for the data carving procedure based on a known parametric distribution. Specifically, when the selection event is characterized by a set of polyhedral constraints on a Gaussian response, data carving will follow the sum of a normal and a truncated normal (SNTN), which is a variant of the truncated bivariate normal distribution. The main impact of this insight is that obtaining exact inference for data carving can be made computationally trivial, since the CDF of the SNTN distribution can be found using the CDF of a standard bivariate normal. A python package sntn has been released to further facilitate the adoption of data carving with PoSI.
翻译:后选择推断 (PoSI) 是一种统计技术,用于在假设生成和检验使用同一数据源时获得有效的置信区间和p值。PoSI 可应用于包括Lasso在内的一系列流行算法。数据切割是PoSI的一种变体,其中在推断阶段,将部分保留数据与假设生成数据相结合。尽管数据切割具有吸引人的理论和实证特性,但现有方法依赖计算成本高昂的MCMC方法进行推断。本文的关键贡献在于表明,基于已知的参数分布,可以为数据切割过程构造枢轴量。具体而言,当选择事件由高斯响应上的一组多面体约束表征时,数据切割将遵循正态与截断正态之和 (SNTN),这是截断双变量正态分布的一种变体。这一见解的主要影响在于,由于SNTN分布的累积分布函数可通过标准双变量正态分布的累积分布函数求得,因此实现数据切割的精确推断在计算上变得极其简便。为进一步推动数据切割与PoSI的采用,已发布Python包sntn。