Compositional data, where only relative abundances are available, are common in microbiome and other high-throughput sequencing studies. Log ratios between groups of variables serve as key biomarkers in these settings. However, selecting predictive log ratios is a combinatorial challenge, and existing greedy search-based methods are computationally expensive, limiting their applicability to high-dimensional data. We introduce the supervised log ratio (SLR) method, a novel and efficient approach for selecting predictive log ratios in high-dimensional settings. SLR first screens active variables using univariate regression on log ratio transformed data and then applies principal balance analysis to define balance biomarkers. Our approach leverages both the relationship between the response and predictors and the correlations among the predictors to improve accuracy in variable selection and prediction. Through simulations and two case studies -- one on inflammatory bowel disease (IBD) and another on colorectal cancer (CRC) -- we demonstrate that SLR outperforms existing methods, particularly in high-dimensional settings. SLR is implemented in an R package, publicly available on GitHub.
翻译:组成数据(仅可获得相对丰度)在微生物组及其他高通量测序研究中十分常见。变量组间的对数比值在这些情境下可作为关键生物标志物。然而,预测性对数比值的选择是一个组合优化难题,现有基于贪婪搜索的方法计算成本高昂,限制了其在高维数据中的适用性。本文提出监督对数比值(SLR)方法,一种在高维环境下选择预测性对数比值的新颖高效途径。SLR首先通过对数比值变换数据的单变量回归筛选活跃变量,随后应用主平衡分析定义平衡生物标志物。本方法同时利用响应变量与预测变量之间的关系以及预测变量间的相关性,以提升变量选择与预测的准确性。通过模拟研究和两项案例研究——一项关于炎症性肠病(IBD),另一项关于结直肠癌(CRC)——我们证明SLR在性能上优于现有方法,尤其在高维环境中表现突出。SLR已实现为R软件包,并在GitHub上公开提供。