High-dimensional compositional data are prevalent in many applications. The simplex constraint poses intrinsic challenges to inferring the conditional dependence relationships among the components forming a composition, as encoded by a large precision matrix. We introduce a precise specification of the compositional precision matrix and relate it to its basis counterpart, which is shown to be asymptotically identifiable under suitable sparsity assumptions. By exploiting this connection, we propose a composition adaptive regularized estimation (CARE) method for estimating the sparse basis precision matrix. We derive rates of convergence for the estimator and provide theoretical guarantees on support recovery and data-driven parameter tuning. Our theory reveals an intriguing trade-off between identification and estimation, thereby highlighting the blessing of dimensionality in compositional data analysis. In particular, in sufficiently high dimensions, the CARE estimator achieves minimax optimality and performs as well as if the basis were observed. We further discuss how our framework can be extended to handle data containing zeros, including sampling zeros and structural zeros. The advantages of CARE over existing methods are illustrated by simulation studies and an application to inferring microbial ecological networks in the human gut.
翻译:高维成分数据在许多应用中普遍存在。单纯形约束对推断构成成分的条件依赖关系(由大规模精度矩阵编码)带来了内在挑战。我们引入了成分精度矩阵的精确规范,并将其与基精度矩阵相关联,证明在适当的稀疏性假设下后者是渐近可识别的。通过利用这一联系,我们提出了一种成分自适应正则化估计(CARE)方法,用于估计稀疏基精度矩阵。我们推导了估计量的收敛速率,并为支持恢复和数据驱动参数调优提供了理论保证。我们的理论揭示了识别与估计之间有趣的权衡,从而凸显了成分数据分析中的维度优势。特别是,在足够高的维度下,CARE估计量达到了极小极大最优性,其表现与基精度矩阵被观测时一样好。我们还进一步讨论了该框架如何扩展到处理含零数据,包括抽样零和结构零。通过模拟研究和推断人类肠道微生物生态网络的应用,我们展示了CARE相对于现有方法的优势。