Repro Samples Method for Model-Free Inference in High-Dimensional Binary Classification

This paper presents a novel method for statistical inference in high-dimensional binary models with unspecified structure, where we leverage a (potentially misspecified) sparsity-constrained working generalized linear model (GLM) to facilitate the inference process. Our method is based on the repro samples framework, which generates artificial samples that mimic the actual data-generating process. Our inference targets include the model support, case probabilities, and the oracle regression coefficients defined in the working GLM. The proposed method has three major advantages. First, this approach is model-free, that is, it does not rely on specific model assumptions such as logistic or probit regression, nor does it require sparsity assumptions on the underlying model. Second, for model support, we construct a model candidate set for the most influential covariates that achieves guaranteed coverage under a weak signal strength assumption. Third, for oracle regression coefficients, we establish confidence sets for any group of linear combinations of regression coefficients. Simulation results demonstrate that the proposed method produces valid and small model candidate sets. It also achieves better coverage for regression coefficients than the state-of-the-art debiasing methods when the working model is the actual model that generates the sample data. Additionally, we analyze single-cell RNA-seq data on the immune response. Besides identifying genes previously proven as relevant in the literature, our method also discovers a significant gene that has not been studied before, revealing a potential new direction in understanding cellular immune response mechanisms.

翻译：本文提出了一种用于高维二分类模型统计推断的新方法，该方法适用于未指定结构的模型，通过利用一个（可能设定错误的）稀疏约束工作广义线性模型来辅助推断过程。我们的方法基于重抽样样本框架，该框架生成模拟真实数据生成过程的人工样本。我们的推断目标包括模型支撑集、案例概率以及在工作广义线性模型中定义的神谕回归系数。所提方法具有三大优势。首先，该方法是无模型的，即不依赖于特定的模型假设（如逻辑回归或概率单位回归），也不要求底层模型满足稀疏性假设。其次，对于模型支撑集，我们在弱信号强度假设下，为最具影响力的协变量构建了一个具有保证覆盖率的模型候选集。第三，对于神谕回归系数，我们为回归系数的任意线性组合组建立了置信集。仿真结果表明，所提方法能产生有效且较小的模型候选集。当工作模型即为生成样本数据的真实模型时，该方法在回归系数的覆盖率方面也优于最先进的去偏方法。此外，我们分析了关于免疫应答的单细胞RNA-seq数据。除了识别出文献中已证明相关的基因外，我们的方法还发现了一个此前未被研究过的重要基因，为理解细胞免疫应答机制揭示了一个潜在的新方向。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/