We develop scalable randomized kernel methods for jointly associating data from multiple sources and simultaneously predicting an outcome or classifying a unit into one of two or more classes. The proposed methods model nonlinear relationships in multiview data together with predicting a clinical outcome and are capable of identifying variables or groups of variables that best contribute to the relationships among the views. We use the idea that random Fourier bases can approximate shift-invariant kernel functions to construct nonlinear mappings of each view and we use these mappings and the outcome variable to learn view-independent low-dimensional representations. Through simulation studies, we show that the proposed methods outperform several other linear and nonlinear methods for multiview data integration. When the proposed methods were applied to gene expression, metabolomics, proteomics, and lipidomics data pertaining to COVID-19, we identified several molecular signatures forCOVID-19 status and severity. Results from our real data application and simulations with small sample sizes suggest that the proposed methods may be useful for small sample size problems. Availability: Our algorithms are implemented in Pytorch and interfaced in R and would be made available at: https://github.com/lasandrall/RandMVLearn.
翻译:本文开发了可扩展的随机核方法,用于联合关联多源数据,并同步预测结果或对单元进行二分类或多分类。所提出的方法对多视图数据中的非线性关系进行建模,同时预测临床结局,并能够识别对视图间关系贡献最大的变量或变量群组。我们利用随机傅里叶基可近似平移不变核函数的思想,构建每个视图的非线性映射,并利用这些映射与结果变量学习视图无关的低维表示。通过模拟研究,我们证明所提出的方法在多项线性与非线性多视图数据整合方法中表现更优。当该方法应用于COVID-19相关的基因表达、代谢组学、蛋白质组学及脂质组学数据时,我们识别出若干COVID-19感染状态和严重程度的分子标志。真实数据应用及小样本模拟结果表明,所提出的方法可能适用于小样本量问题。可用性:我们的算法基于Pytorch实现并提供R语言接口,代码将发布于:https://github.com/lasandrall/RandMVLearn。