This paper presents a novel method for statistical inference in high-dimensional binary models with unspecified structure, where we leverage a (potentially misspecified) sparsity-constrained working generalized linear model (GLM) to facilitate the inference process. Our method is based on the repro samples framework, which generates artificial samples that mimic the actual data-generating process. Our inference targets include the model support, case probabilities, and the oracle regression coefficients defined in the working GLM. The proposed method has three major advantages. First, this approach is model-free, that is, it does not rely on specific model assumptions such as logistic or probit regression, nor does it require sparsity assumptions on the underlying model. Second, for model support, we construct a model candidate set for the most influential covariates that achieves guaranteed coverage under a weak signal strength assumption. Third, for oracle regression coefficients, we establish confidence sets for any group of linear combinations of regression coefficients. Simulation results demonstrate that the proposed method produces valid and small model candidate sets. It also achieves better coverage for regression coefficients than the state-of-the-art debiasing methods when the working model is the actual model that generates the sample data. Additionally, we analyze single-cell RNA-seq data on the immune response. Besides identifying genes previously proven as relevant in the literature, our method also discovers a significant gene that has not been studied before, revealing a potential new direction in understanding cellular immune response mechanisms.
翻译:本文提出了一种用于高维二分类模型统计推断的新方法,该方法适用于未指定结构的模型,通过利用一个(可能设定错误的)稀疏约束工作广义线性模型来辅助推断过程。我们的方法基于重抽样样本框架,该框架生成模拟真实数据生成过程的人工样本。我们的推断目标包括模型支撑集、案例概率以及在工作广义线性模型中定义的神谕回归系数。所提方法具有三大优势。首先,该方法是无模型的,即不依赖于特定的模型假设(如逻辑回归或概率单位回归),也不要求底层模型满足稀疏性假设。其次,对于模型支撑集,我们在弱信号强度假设下,为最具影响力的协变量构建了一个具有保证覆盖率的模型候选集。第三,对于神谕回归系数,我们为回归系数的任意线性组合组建立了置信集。仿真结果表明,所提方法能产生有效且较小的模型候选集。当工作模型即为生成样本数据的真实模型时,该方法在回归系数的覆盖率方面也优于最先进的去偏方法。此外,我们分析了关于免疫应答的单细胞RNA-seq数据。除了识别出文献中已证明相关的基因外,我们的方法还发现了一个此前未被研究过的重要基因,为理解细胞免疫应答机制揭示了一个潜在的新方向。