Gaussian process (GP) is a Bayesian model which provides several advantages for regression tasks in machine learning such as reliable quantitation of uncertainty and improved interpretability. Their adoption has been precluded by their excessive computational cost and by the difficulty in adapting them for analyzing sequences (e.g. amino acid and nucleotide sequences) and graphs (e.g. ones representing small molecules). In this study, we develop efficient and scalable approaches for fitting GP models as well as fast convolution kernels which scale linearly with graph or sequence size. We implement these improvements by building an open-source Python library called xGPR. We compare the performance of xGPR with the reported performance of various deep learning models on 20 benchmarks, including small molecule, protein sequence and tabular data. We show that xGRP achieves highly competitive performance with much shorter training time. Furthermore, we also develop new kernels for sequence and graph data and show that xGPR generally outperforms convolutional neural networks on predicting key properties of proteins and small molecules. Importantly, xGPR provides uncertainty information not available from typical deep learning models. Additionally, xGPR provides a representation of the input data that can be used for clustering and data visualization. These results demonstrate that xGPR provides a powerful and generic tool that can be broadly useful in protein engineering and drug discovery.
翻译:高斯过程(GP)是一种贝叶斯模型,在机器学习回归任务中具有多项优势,例如可靠的不确定性量化与增强的可解释性。但由于其过高的计算成本,以及难以适配序列(如氨基酸序列和核苷酸序列)与图结构(如表示小分子的图)分析,其应用受到限制。在本研究中,我们开发了适用于GP模型拟合的高效可扩展方法,以及随图或序列规模线性增长的快速卷积核。我们通过构建名为xGPR的开源Python库实现这些改进。我们将xGPR的性能与20个基准测试中多个深度学习模型的报告性能进行对比,涵盖小分子、蛋白质序列和表格数据。结果表明,xGPR能以更短的训练时间达到极具竞争力的性能。此外,我们还为序列和图数据开发了新核,并证明xGPR在预测蛋白质与小分子关键性质方面通常优于卷积神经网络。重要的是,xGPR能提供典型深度学习模型无法获取的不确定性信息。同时,xGPR可生成适用于聚类和数据可视化的输入数据表示。这些结果证明,xGPR是一种强大的通用工具,可广泛应用于蛋白质工程与药物发现领域。