Here we show an application of our recently proposed information-geometric approach to compositional data analysis (CoDA). This application regards relative count data, which are, e.g., obtained from sequencing experiments. First we review in some detail a variety of necessary concepts ranging from basic count distributions and their information-geometric description over the link between Bayesian statistics and shrinkage to the use of power transformations in CoDA. We then show that powering, i.e., the equivalent to scalar multiplication on the simplex, can be understood as a shrinkage problem on the tangent space of the simplex. In information-geometric terms, traditional shrinkage corresponds to an optimization along a mixture (or m-) geodesic, while powering (or, as we call it, exponential shrinkage) can be optimized along an exponential (or e-) geodesic. While the m-geodesic corresponds to the posterior mean of the multinomial counts using a conjugate prior, the e-geodesic corresponds to an alternative parametrization of the posterior where prior and data contributions are weighted by geometric rather than arithmetic means. To optimize the exponential shrinkage parameter, we use mean-squared error as a cost function on the tangent space. This is just the expected squared Aitchison distance from the true parameter. We derive an analytic solution for its minimum based on the delta method and test it via simulations. We also discuss exponential shrinkage as an alternative to zero imputation for dimension reduction and data normalization.
翻译:本文展示了我们近期提出的信息几何方法在成分数据分析(CoDA)中的应用。该应用涉及相对计数数据,例如来自测序实验的数据。我们首先详细回顾了各种必要概念,包括基本计数分布及其信息几何描述、贝叶斯统计与收缩之间的联系,以及CoDA中幂变换的使用。接着,我们证明幂变换(即单纯形上的标量乘法等价形式)可以被理解为单纯形切空间上的一个收缩问题。在信息几何术语中,传统收缩对应于沿混合(或m-)测地线的优化,而幂变换(或称指数收缩)则可通过沿指数(或e-)测地线进行优化。m-测地线对应于使用共轭先验的多项式计数的后验均值,而e-测地线则对应于后验的另一种参数化形式,其中先验与数据贡献通过几何均值而非算术均值进行加权。为了优化指数收缩参数,我们将均方误差作为切空间上的代价函数,即与真实参数的期望平方艾奇逊距离。基于Delta方法,我们推导出其最小值的解析解,并通过模拟进行验证。我们还讨论了指数收缩作为零值插补的替代方案,用于降维和数据归一化。