We introduce a novel approach to compositional data analysis based on $L^{\infty}$-normalization, addressing challenges posed by zero-rich high-throughput data. Traditional methods like Aitchison's transformations require excluding zeros, conflicting with the reality that omics datasets contain structural zeros that cannot be removed without violating inherent biological structures. Such datasets exist exclusively on the boundary of compositional space, making interior-focused approaches fundamentally misaligned. We present a family of $L^p$-normalizations, focusing on $L^{\infty}$-normalization due to its advantageous properties. This approach identifies compositional space with the $L^{\infty}$-simplex, represented as a union of top-dimensional faces called $L^{\infty}$-cells. Each cell consists of samples where one component's absolute abundance equals or exceeds all others, with a coordinate system identifying it with a d-dimensional unit cube. When applied to vaginal microbiome data, $L^{\infty}$-decomposition aligns with established Community State Types while offering advantages: each $L^{\infty}$-CST is named after its dominating component, has clear biological meaning, remains stable under sample changes, resolves cluster-based issues, and provides a coordinate system for exploring internal structure. We extend homogeneous coordinates through cube embedding, mapping data into a d-dimensional unit cube. These embeddings can be integrated via Cartesian product, providing unified representations from multiple perspectives. While demonstrated through microbiome studies, these methods apply to any compositional data.
翻译:我们提出了一种基于 \(L^{\infty}\) 归一化的组合数据分析新方法,以应对富含零值的高通量数据带来的挑战。传统方法如Aitchison变换需要排除零值,这与组学数据集中存在结构性零值的现实相矛盾,这些零值若被移除会破坏固有的生物结构。此类数据集完全位于组合空间的边界上,使得聚焦于内部的方法从根本上不匹配。我们提出了一系列 \(L^p\) 归一化方法,并重点探讨 \(L^{\infty}\) 归一化因其优越特性。该方法将组合空间等同于 \(L^{\infty}\) 单纯形,表示为一系列称为 \(L^{\infty}\) 胞腔的顶维面的并集。每个胞腔由其中某个组分的绝对丰度等于或超过所有其他组分的样本构成,并具有一个坐标系,使其与一个d维单位立方体等同。当应用于阴道微生物组数据时,\(L^{\infty}\) 分解与已建立的群落状态类型(CST)相符,同时具有以下优势:每个 \(L^{\infty}\)-CST 以其主导组分命名,具有明确的生物学意义,在样本变化下保持稳定,解决了基于聚类方法的问题,并提供了一个用于探索内部结构的坐标系。我们通过立方体嵌入扩展了齐次坐标,将数据映射到一个d维单位立方体中。这些嵌入可以通过笛卡尔积进行整合,从而提供来自多个视角的统一表示。尽管通过微生物组研究进行了演示,但这些方法适用于任何组合数据。