Biomedical research now commonly integrates diverse data types or views from the same individuals to better understand the pathobiology of complex diseases, but the challenge lies in meaningfully integrating these diverse views. Existing methods often require the same type of data from all views (cross-sectional data only or longitudinal data only) or do not consider any class outcome in the integration method, presenting limitations. To overcome these limitations, we have developed a pipeline that harnesses the power of statistical and deep learning methods to integrate cross-sectional and longitudinal data from multiple sources. Additionally, it identifies key variables contributing to the association between views and the separation among classes, providing deeper biological insights. This pipeline includes variable selection/ranking using linear and nonlinear methods, feature extraction using functional principal component analysis and Euler characteristics, and joint integration and classification using dense feed-forward networks and recurrent neural networks. We applied this pipeline to cross-sectional and longitudinal multi-omics data (metagenomics, transcriptomics, and metabolomics) from an inflammatory bowel disease (IBD) study and we identified microbial pathways, metabolites, and genes that discriminate by IBD status, providing information on the etiology of IBD. We conducted simulations to compare the two feature extraction methods. The proposed pipeline is available from the following GitHub repository: https://github.com/lasandrall/DeepIDA-GRU.
翻译:摘要:生物医学研究现常整合同一受试者的多种数据类型或视图,以深入理解复杂疾病的病理机制,但如何有意义地整合这些多元视图仍面临挑战。现有方法通常要求所有视图具有相同数据类型(仅限横截面数据或仅限纵向数据),或在整合过程中未考虑类别结局,存在局限性。为克服这些局限,我们开发了一套流水线,利用统计学与深度学习方法整合来自多源的横截面与纵向数据。此外,该流水线可识别促进视图间关联及类别间分离的关键变量,从而提供更深入的生物学见解。该流水线包括:基于线性与非线性方法的变量选择/排序、采用函数主成分分析与欧拉特征的特征提取,以及基于密集前馈网络与循环神经网络的联合整合与分类。我们将此流水线应用于一项炎症性肠病(IBD)研究的横截面与纵向多组学数据(宏基因组学、转录组学与代谢组学),识别了可区分IBD状态的微生物通路、代谢物与基因,揭示了IBD病因学信息。我们通过模拟实验比较了两种特征提取方法。本流水线可从以下GitHub仓库获取:https://github.com/lasandrall/DeepIDA-GRU。