When analyzing large datasets, it is common to select a model prior to making inferences. For reliable inferences, it is important to make adjustments that account for the model selection process, resulting in selective inferences. Our paper introduces an asymptotic pivot to infer about the effects of selected variables on conditional quantile functions. Utilizing estimators from smoothed quantile regression, our proposed pivot is easy to compute and ensures asymptotically-exact selective inferences without making strict distributional assumptions about the response variable. At the core of the pivot is the use of external randomization, which enables us to utilize the full sample for both selection and inference without the need to partition the data into independent data subsets or discard data at either step. On simulated data, we find that: (i) the asymptotic confidence intervals based on our pivot achieve the desired coverage rates, even in cases where sample splitting fails due to insufficient sample size for inference; (ii) our intervals are consistently shorter than those produced by sample splitting across various models and signal settings. We report similar findings when we apply our approach to study risk factors for low birth weights in a publicly accessible dataset of US birth records from 2022.
翻译:在分析大规模数据集时,通常会在进行推断前先选择模型。为确保推断的可靠性,需对模型选择过程进行调整,由此产生选择性推断。本文提出了一种渐近枢轴量,用于推断所选变量对条件分位数函数的影响。基于平滑分位数回归估计量,本文所提出的枢轴量易于计算,且能在不对响应变量做严格分布假设的前提下实现渐近精确的选择性推断。该枢轴量的核心在于利用外部随机化,从而能够将全部样本同时用于选择与推断,无需将数据划分为独立子集或在任一阶段舍弃数据。通过模拟实验我们发现:(i)基于该枢轴量的渐近置信区间能达到目标覆盖概率,即使在样本分裂因推断样本量不足而失效的情形下亦然;(ii)我们的区间在各种模型和信号设定下均一致短于样本分裂法生成的区间。当我们将该方法应用于2022年美国出生记录公开数据集研究低出生体重风险因素时,得到了类似的结果。