Tens of thousands of simultaneous hypothesis tests are routinely performed in genomic studies to identify differentially expressed genes. However, due to unmeasured confounders, many standard statistical approaches may be substantially biased. This paper investigates the large-scale hypothesis testing problem for multivariate generalized linear models in the presence of confounding effects. Under arbitrary confounding mechanisms, we propose a unified statistical estimation and inference framework that harnesses orthogonal structures and integrates linear projections into three key stages. It first leverages multivariate responses to separate marginal and uncorrelated confounding effects, recovering the confounding coefficients' column space. Subsequently, latent factors and primary effects are jointly estimated, utilizing $\ell_1$-regularization for sparsity while imposing orthogonality onto confounding coefficients. Finally, we incorporate projected and weighted bias-correction steps for hypothesis testing. Theoretically, we establish various effects' identification conditions and non-asymptotic error bounds. We show effective Type-I error control of asymptotic $z$-tests as sample and response sizes approach infinity. Numerical experiments demonstrate that the proposed method controls the false discovery rate by the Benjamini-Hochberg procedure and is more powerful than alternative methods. By comparing single-cell RNA-seq counts from two groups of samples, we demonstrate the suitability of adjusting confounding effects when significant covariates are absent from the model.
翻译:基因研究中通常进行数以万计的同步假设检验,以识别差异表达基因。然而,由于未测量的混杂因素,许多标准统计方法可能存在显著偏差。本文研究存在混杂效应时多元广义线性模型的大规模假设检验问题。在任意混杂机制下,我们提出了一种统一的统计估计与推断框架,该框架利用正交结构并将线性投影整合到三个关键阶段。首先利用多元响应分离边际效应与不相关混杂效应,恢复混杂系数的列空间;随后联合估计潜在因子与主效应,采用$\ell_1$正则化实现稀疏性,同时对混杂系数施加正交约束;最后在假设检验中引入投影加权偏差校正步骤。理论上,我们建立了多种效应的可识别条件及非渐近误差界,证明当样本量与响应维度趋于无穷时,渐近$z$检验能有效控制第一类错误。数值实验表明,所提方法通过Benjamini-Hochberg程序控制错误发现率,且功效优于其他方法。通过比较两组样本的单细胞RNA-seq计数,我们证明了在模型中缺失重要协变量时调整混杂效应的适用性。