We consider the problem of estimating fold-changes in the expected value of a multivariate outcome that is observed subject to unknown sample-specific and category-specific perturbations. We are motivated by high-throughput sequencing studies of the abundance of microbial taxa, in which microbes are systematically over- and under-detected relative to their true abundances. Our log-linear model admits a partially identifiable estimand, and we establish full identifiability by imposing interpretable parameter constraints. To reduce bias and guarantee the existence of parameter estimates in the presence of sparse observations, we apply an asymptotically negligible and constraint-invariant penalty to our estimating function. We develop a fast coordinate descent algorithm for estimation, and an augmented Lagrangian algorithm for estimation under null hypotheses. We construct a model-robust score test, and demonstrate valid inference even for small sample sizes and violated distributional assumptions. The flexibility of the approach and comparisons to related methods are illustrated via a meta-analysis of microbial associations with colorectal cancer.
翻译:我们考虑在观测结果受到未知样本特异性与类别特异性扰动影响时,估算多元变量期望值倍数变化的问题。该研究受高通量测序研究中微生物类群丰度分析驱动——在此类研究中,微生物的实际丰度存在系统性高估或低估。我们提出的对数线性模型允许部分可识别的估计量,通过施加具有可解释性的参数约束实现完全可识别性。为在稀疏观测条件下减少偏差并保证参数估计的存在性,我们在估计函数中引入渐近可忽略且约束不变的惩罚项。我们开发了用于参数估计的快速坐标下降算法,以及用于零假设检验的增广拉格朗日估计算法。通过构建模型稳健的得分检验,证明即使在样本量较小且分布假设被违反的情况下仍能进行有效推断。通过结直肠癌微生物关联的荟萃分析,我们展示了该方法的灵活性及与相关方法的比较。