High-dimensional compositional data are prevalent in many applications. The simplex constraint poses intrinsic challenges to inferring the conditional dependence relationships among the components forming a composition, as encoded by a large precision matrix. We introduce a precise specification of the compositional precision matrix and relate it to its basis counterpart, which is shown to be asymptotically identifiable under suitable sparsity assumptions. By exploiting this connection, we propose a composition adaptive regularized estimation (CARE) method for estimating the sparse basis precision matrix. We derive rates of convergence for the estimator and provide theoretical guarantees on support recovery and data-driven parameter tuning. Our theory reveals an intriguing trade-off between identification and estimation, thereby highlighting the blessing of dimensionality in compositional data analysis. In particular, in sufficiently high dimensions, the CARE estimator achieves minimax optimality and performs as well as if the basis were observed. We further discuss how our framework can be extended to handle data containing zeros, including sampling zeros and structural zeros. The advantages of CARE over existing methods are illustrated by simulation studies and an application to inferring microbial ecological networks in the human gut.
翻译:摘要:高维成分数据在许多应用中广泛存在。单纯形约束对推断组成成分间的条件依赖关系(由大规模精度矩阵编码)构成了内在挑战。我们提出了成分精度矩阵的精确规范,并将其与基精度矩阵相关联,证明在适当的稀疏性假设下,基精度矩阵是渐近可识别的。通过利用这一联系,我们提出了一种成分自适应正则化估计(CARE)方法,用于估计稀疏基精度矩阵。我们推导了估计量的收敛速率,并提供了支持恢复和数据驱动参数调优的理论保证。理论揭示了识别与估计之间引人注目的权衡,从而凸显了成分数据分析中“维度之福”。特别地,在足够高的维度下,CARE估计量达到 minimax 最优性,其表现如同基精度矩阵已被观测到一样。我们进一步讨论了该框架如何扩展以处理含零数据,包括抽样零和结构零。通过模拟实验和推断人类肠道微生物生态网络的应用,我们展示了CARE方法相较于现有方法的优势。