In this paper, we propose a data-driven method to learn interpretable topological features of biomolecular data and demonstrate the efficacy of parsimonious models trained on topological features in predicting the stability of synthetic mini proteins. We compare models that leverage automatically-learned structural features against models trained on a large set of biophysical features determined by subject-matter experts (SME). Our models, based only on topological features of the protein structures, achieved 92%-99% of the performance of SME-based models in terms of the average precision score. By interrogating model performance and feature importance metrics, we extract numerous insights that uncover high correlations between topological features and SME features. We further showcase how combining topological features and SME features can lead to improved model performance over either feature set used in isolation, suggesting that, in some settings, topological features may provide new discriminating information not captured in existing SME features that are useful for protein stability prediction.
翻译:本文提出了一种数据驱动的方法,用于学习生物分子数据中可解释的拓扑特征,并证明了基于拓扑特征构建的简约模型在预测合成微型蛋白质稳定性方面的有效性。我们将利用自动学习结构特征的模型与基于领域专家确定的大量生物物理特征训练的模型进行了比较。仅基于蛋白质结构拓扑特征的模型,在平均精度分数方面达到了专家特征模型性能的92%-99%。通过分析模型性能和特征重要性指标,我们提取了多项发现,揭示了拓扑特征与专家特征之间的高度相关性。我们进一步展示了结合拓扑特征与专家特征能够获得比单独使用任一特征集更优的模型性能,这表明在某些情况下,拓扑特征可能提供现有专家特征未能捕捉的、对蛋白质稳定性预测有用的新判别信息。