Motivation: The size of available omics datasets is steadily increasing with technological advancement in recent years. While this increase in sample size can be used to improve the performance of relevant prediction tasks in healthcare, models that are optimized for large datasets usually operate as black boxes. In high stakes scenarios, like healthcare, using a black-box model poses safety and security issues. Without an explanation about molecular factors and phenotypes that affected the prediction, healthcare providers are left with no choice but to blindly trust the models. We propose a new type of artificial neural network, named Convolutional Omics Kernel Network (COmic). By combining convolutional kernel networks with pathway-induced kernels, our method enables robust and interpretable end-to-end learning on omics datasets ranging in size from a few hundred to several hundreds of thousands of samples. Furthermore, COmic can be easily adapted to utilize multi-omics data. Results: We evaluated the performance capabilities of COmic on six different breast cancer cohorts. Additionally, we trained COmic models on multi-omics data using the METABRIC cohort. Our models performed either better or similar to competitors on both tasks. We show how the use of pathway-induced Laplacian kernels opens the black-box nature of neural networks and results in intrinsically interpretable models that eliminate the need for post-hoc explanation models.
翻译:动机:近年来,随着技术进步,可用组学数据集的规模稳步增长。尽管样本量的增加可用于提升医疗相关预测任务的性能,但针对大数据集优化的模型通常以黑箱形式运作。在医疗等高风险场景中,使用黑箱模型会引发安全与保障问题。若缺乏对影响预测的分子因素与表型的解释,医疗提供者将别无选择,只能盲目信任模型。我们提出一种新型人工神经网络——卷积组学核网络(COmic)。通过将卷积核网络与通路诱导核相结合,该方法能够在样本量从数百到数十万的组学数据集上实现鲁棒且可解释的端到端学习。此外,COmic可轻松适配以利用多组学数据。结果:我们在六个不同的乳腺癌队列上评估了COmic的性能。同时,利用METABRIC队列在多元组学数据上训练COmic模型。我们的模型在两项任务中均表现优于或持平于竞争对手。我们展示了通路诱导拉普拉斯核如何打开神经网络的黑箱特性,并生成内在可解释的模型,从而消除对事后解释模型的需求。