In real world applications dealing with compositional datasets, it is easy to face the presence of structural zeros. The latter arise when, due to physical limitations, one or more variables are intrinsically zero for a subset of the population under study. The classical Aitchison approach requires all the components of a composition to be strictly positive, since the adaptation of the most widely used statistical techniques to the compositional framework relies on computing the logratios of these components. Therefore, datasets containing structural zeros are usually split in two subsets, the one containing the observations with structural zeros and the one containing all the other data. Then statistical analysis is performed on the two subsets separately, assuming the two datasets are drawn from two different subpopulations. However, this approach may lead to incomplete results when the split into two populations is merely artificial. To overcome this limitation and increase the robustness of such an approach, we introduce a statistical test to check whether the first K principal components of the two datasets generate the same vector space. An approximation of the corresponding null distribution is derived analytically when data are normally distributed on the simplex and through a nonparametric bootstrap approach in the other cases. Results from simulated data demonstrate that the proposed procedure can discriminate scenarios where the subpopulations share a common subspace from those where they are actually distinct. The performance of the proposed method is also tested on an experimental dataset concerning microbiome measurements.
翻译:在处理成分数据的实际应用中,结构零的存在十分常见。当由于物理限制,研究总体中某个子集的一个或多个变量本质为零时,便会出现结构零。经典的Aitchison方法要求组合物的所有组分严格为正,因为将最广泛使用的统计技术适配到成分框架依赖于计算这些组分的对数比。因此,包含结构零的数据集通常被分为两个子集:一个包含具有结构零的观测值,另一个包含所有其他数据。随后分别对这两个子集进行统计分析,假设这两个数据集来自两个不同的子总体。然而,当这种划分为两个总体的做法仅仅是人为假定时,这种方法可能导致不完整的结果。为了克服这一限制并提高此类方法的稳健性,我们引入了一种统计检验,用于检验两个数据集的前K个主成分是否生成相同的向量空间。当数据在单纯形上呈正态分布时,我们通过解析方法推导了相应零分布的近似;在其他情况下,则通过非参数自助法进行近似。模拟数据的结果表明,所提出的程序能够区分子总体共享公共子空间的场景与它们实际不同的场景。该方法的性能还在一个关于微生物组测量的实验数据集上进行了测试。