Compositional data arise in many areas of research in the natural and biomedical sciences. One prominent example is in the study of the human gut microbiome, where one can measure the relative abundance of many distinct microorganisms in a subject's gut. Often, practitioners are interested in learning how the dependencies between microbes vary across distinct populations or experimental conditions. In statistical terms, the goal is to estimate a covariance matrix for the (latent) log-abundances of the microbes in each of the populations. However, the compositional nature of the data prevents the use of standard estimators for these covariance matrices. In this article, we propose an estimator of multiple covariance matrices which allows for information sharing across distinct populations of samples. Compared to some existing estimators, which estimate the covariance matrices of interest indirectly, our estimator is direct, ensures positive definiteness, and is the solution to a convex optimization problem. We compute our estimator using a proximal-proximal gradient descent algorithm. Asymptotic properties of our estimator reveal that it can perform well in high-dimensional settings. Through simulation studies, we demonstrate that our estimator can outperform existing estimators. We show that our method provides more reliable estimates than competitors in an analysis of microbiome data from subjects with chronic fatigue syndrome.
翻译:成分数据在自然科学和生物医学等众多研究领域中出现。一个显著的例子是人类肠道微生物组研究,其中可以测量受试者肠道内多种不同微生物的相对丰度。通常,研究者们希望了解不同人群或实验条件下微生物之间依赖关系的变化。从统计学的角度来看,目标是估计每个总体中微生物(潜在)对数丰度的协方差矩阵。然而,数据的成分性质阻碍了在这些协方差矩阵上使用标准估计量。在本文中,我们提出了一种多协方差矩阵的估计量,该估计量允许在不同样本总体之间共享信息。与一些间接估计感兴趣协方差矩阵的现有估计量相比,我们的估计量是直接的,确保正定性,并且是一个凸优化问题的解。我们使用近端-近端梯度下降算法来计算该估计量。其渐近性质表明,该估计量在高维设置中表现良好。通过模拟研究,我们证明该估计量可以优于现有的估计量。在对慢性疲劳综合征受试者的微生物组数据进行分析时,我们发现该方法比竞争方法提供了更可靠的估计。