The approach to analysing compositional data has been dominated by the use of logratio transformations, to ensure exact subcompositional coherence and, in some situations, exact isometry as well. A problem with this approach is that data zeros, found in most applications, have to be replaced to allow the logarithmic transformation. An alternative new approach, called the `chiPower' transformation, which allows data zeros, is to combine the standardization inherent in the chi-square distance in correspondence analysis, with the essential elements of the Box-Cox power transformation. The chiPower transformation is justified because it} defines between-sample distances that tend to logratio distances for strictly positive data as the power parameter tends to zero, and are then equivalent to transforming to logratios. For data with zeros, a value of the power can be identified that brings the chiPower transformation as close as possible to a logratio transformation, without having to substitute the zeros. Especially in the area of high-dimensional data, this alternative approach can present such a high level of coherence and isometry as to be a valid approach to the analysis of compositional data. Furthermore, in a supervised learning context, if the compositional variables serve as predictors of a response in a modelling framework, for example generalized linear models, then the power can be used as a tuning parameter in optimizing the accuracy of prediction through cross-validation. The chiPower-transformed variables have a straightforward interpretation, since they are each identified with single compositional parts, not ratios.
翻译:成分数据分析的主流方法依赖于对数比变换,以确保精确的子成分一致性和某些情况下的精确等距性。这种方法存在的问题在于,大多数应用中存在的零值数据必须经过替换才能进行对数变换。一种称为"chiPower变换"的替代新方法允许数据存在零值,它将对应分析中卡方距离所固有的标准化与Box-Cox幂变换的基本要素相结合。chiPower变换的合理性在于:当幂参数趋近于零时,其对严格正数据定义的样本间距离趋近于对数比距离,从而等价于对数比变换。对于包含零值的数据,可以确定一个幂参数值,使chiPower变换在无需替换零值的情况下尽可能接近对数比变换。特别在高维数据领域,这种替代方法能够达到足够高的连贯性和等距性,成为分析成分数据的有效手段。此外,在有监督学习背景下,如果成分变量在建模框架(例如广义线性模型)中充当响应的预测变量,则幂参数可作为交叉验证优化预测精度的调优参数。经chiPower变换后的变量具有直观的解释性,因为每个变量都对应于单一成分部分而非比值。