Existing statistical methods for compositional data analysis are inadequate for many modern applications for two reasons. First, modern compositional datasets, for example in microbiome research, display traits such as high-dimensionality and sparsity that are poorly modelled with traditional approaches. Second, assessing -- in an unbiased way -- how summary statistics of a composition (e.g., racial diversity) affect a response variable is not straightforward. In this work, we propose a framework based on hypothetical data perturbations that addresses both issues. Unlike existing methods for compositional data, we do not transform the data and instead use perturbations to define interpretable statistical functionals on the compositions themselves, which we call average perturbation effects. These average perturbation effects, which can be employed in many applications, naturally account for confounding that biases frequently used marginal dependence analyses. We show how average perturbation effects can be estimated efficiently by deriving a perturbation-dependent reparametrization and applying semiparametric estimation techniques. We analyze the proposed estimators empirically on simulated data and demonstrate advantages over existing techniques on US census and microbiome data. For all proposed estimators, we provide confidence intervals with uniform asymptotic coverage guarantees.
翻译:现有成分数据分析的统计方法已难以满足现代应用的诸多需求,原因有二。其一,现代成分数据集(如微生物组研究中的数据)呈现出传统方法难以有效建模的高维性和稀疏性等特征。其二,以无偏方式评估成分汇总统计量(例如种族多样性)如何影响响应变量并非易事。本研究提出一个基于假设性数据扰动的框架,同时解决上述两个问题。与现有成分数据分析方法不同,我们不对数据进行变换,而是通过扰动在成分本身上定义可解释的统计泛函,称之为平均扰动效应。这些平均扰动效应可广泛应用于多种场景,并能自然地解释经常导致边际依赖分析偏倚的混杂因素。通过推导与扰动相关的重参数化方法,并结合半参数估计技术,我们展示了如何高效地估计平均扰动效应。我们在模拟数据上对所提估计量进行实证分析,并在美国人口普查数据与微生物组数据上验证了其相较现有技术的优势。针对所有提出的估计量,我们提供了具有均匀渐近覆盖保证的置信区间。