We consider inference for M-estimators after model selection using a sparsity-inducing penalty. While existing methods for this task require bespoke inference procedures, we propose a simpler approach, which relies on two insights: (i) adding and subtracting carefully-constructed noise to a Gaussian random variable with unknown mean and known variance leads to two \emph{independent} Gaussian random variables; and (ii) both the selection event resulting from penalized M-estimation, and the event that a standard (non-selective) confidence interval for an M-estimator covers its target, can be characterized in terms of an approximately normal ``score variable". We combine these insights to show that -- when the noise is chosen carefully -- there is asymptotic independence between the model selected using a noisy penalized M-estimator, and the event that a standard (non-selective) confidence interval on noisy data covers the selected parameter. Therefore, selecting a model via penalized M-estimation (e.g. \verb=glmnet= in \verb=R=) on noisy data, and then conducting \emph{standard} inference on the selected model (e.g. \verb=glm= in \verb=R=) using noisy data, yields valid inference: \emph{no bespoke methods are required}. Our results require independence of the observations, but only weak distributional requirements. We apply the proposed approach to conduct inference on the association between sex and smoking in a social network.
翻译:我们考虑在使用稀疏诱导惩罚进行模型选择后对M估计量进行推断。尽管现有方法需要定制化的推断流程,我们提出了一种更简化的途径,该方法基于两个关键洞见:(i) 向具有未知均值与已知方差的高斯随机变量添加及减去精心构造的噪声,可产生两个相互独立的高斯随机变量;(ii) 惩罚M估计产生的选择事件,以及M估计量的标准(非选择性)置信区间覆盖其目标参数的事件,均可通过近似正态的“得分变量”来表征。结合这些洞见,我们证明:当噪声经过精心选择时,使用含噪惩罚M估计量选择的模型与标准(非选择性)置信区间在含噪数据上覆盖选定参数的事件之间具有渐近独立性。因此,通过在含噪数据上使用惩罚M估计(例如R语言中的glmnet)选择模型,随后对选定模型使用含噪数据进行标准推断(例如R语言中的glm),即可获得有效的推断结果:无需任何定制化方法。我们的结论要求观测值相互独立,但仅需较弱的分布假设。我们将所提方法应用于社交网络中性别与吸烟行为关联性的推断研究。