Compositional data arise when count observations are normalised into proportions adding up to unity. To allow use of standard statistical methods, compositional proportions can be mapped from the simplex into the Euclidean space through the isometric log-ratio (ilr) transformation. When the counts follow a multinomial distribution with fixed class-specific probabilities, the distribution of the ensuing ilr coordinates has been shown to be asymptotically multivariate normal. We here derive an asymptotic normal approximation to the distribution of the ilr coordinates when the counts show overdispersion under the Dirichlet-multinomial mixture model. Using a simulation study, we then investigate the practical applicability of the approximation against the empirical distribution of the ilr coordinates under varying levels of extra-multinomial variation and the total count. The approximation works well, except with a small total count or high amount of overdispersion. These empirical results remain even under population-level heterogeneity in the total count. Our work is motivated by microbiome data, which often exhibit considerable extra-multinomial variation and are increasingly treated as compositional through scaling taxon-specific counts into proportions. We conclude that if the analysis of empirical data relies on normality of the ilr coordinates, it may be advisable to choose a taxonomic level where counts are less sparse so that the distribution of taxon-specific class probabilities remains unimodal.
翻译:当计数观测值被归一化为总和为一的比例时,便产生了成分数据。为了使用标准统计方法,可通过等距对数比变换将成分比例从单纯形映射到欧几里得空间。当计数服从具有固定类别特定概率的多项分布时,后续ilr坐标的分布已被证明渐近服从多元正态分布。本文在狄利克雷-多项混合模型下,推导了计数存在过度离散时ilr坐标分布的渐近正态近似。通过模拟研究,我们考察了在不同程度的额外多项变异和总计数下,该近似与ilr坐标经验分布的实用契合度。除总计数较小或过度离散程度较高的情况外,该近似效果良好。即使在总计数存在群体异质性时,这些经验结果仍然成立。本研究源于微生物组数据,此类数据通常呈现显著的额外多项变异,并越来越多地通过将分类单元特异性计数缩放为比例的方式作为成分数据处理。我们得出结论:若实证数据分析依赖于ilr坐标的正态性假设,建议选择计数较不稀疏的分类层级,以确保分类单元特异性类别概率的分布保持单峰性。