Compositional data in which only the relative abundances of variables are measured are ubiquitous. In the context of health and medical compositional data, an important class of biomarkers is the log ratios between groups of variables. However, selecting log ratios that are predictive of a response variable is a combinatorial problem. Existing greedy-search based methods are time-consuming, which hinders their application to high-dimensional data sets. We propose a novel selection approach called the supervised log ratio method that can efficiently select predictive log ratios in high-dimensional settings. The proposed method is motivated by a latent variable model and we show that the log ratio biomarker can be selected via simple clustering after supervised feature screening. The supervised log ratio method is implemented in an R package, which is publicly available at \url{https://github.com/drjingma/slr}. We illustrate the merits of our approach through simulation studies and analysis of a microbiome data set on HIV infection.
翻译:仅测量变量相对丰度的成分数据在各类研究中普遍存在。在健康和医学成分数据背景下,变量组间的对数比是一类重要的生物标志物。然而,选择能够预测响应变量的对数比是一个组合优化问题。现有基于贪婪搜索的方法耗时较长,限制了其在高维数据集中的应用。我们提出一种名为"有监督对数比方法"的新型筛选策略,可在高维场景下高效选择具有预测能力的对数比。该方法受潜在变量模型启发,我们证明通过监督特征筛选后的简单聚类即可实现对数比生物标志物的选取。该有监督对数比方法已封装于R语言包中,公开获取地址为\url{https://github.com/drjingma/slr}。通过模拟实验与HIV感染相关微生物组数据分析,我们验证了该方法的优越性。