Protein structure-based property prediction has emerged as a promising approach for various biological tasks, such as protein function prediction and sub-cellular location estimation. The existing methods highly rely on experimental protein structure data and fail in scenarios where these data are unavailable. Predicted protein structures from AI tools (e.g., AlphaFold2) were utilized as alternatives. However, we observed that current practices, which simply employ accurately predicted structures during inference, suffer from notable degradation in prediction accuracy. While similar phenomena have been extensively studied in general fields (e.g., Computer Vision) as model robustness, their impact on protein property prediction remains unexplored. In this paper, we first investigate the reason behind the performance decrease when utilizing predicted structures, attributing it to the structure embedding bias from the perspective of structure representation learning. To study this problem, we identify a Protein 3D Graph Structure Learning Problem for Robust Protein Property Prediction (PGSL-RP3), collect benchmark datasets, and present a protein Structure embedding Alignment Optimization framework (SAO) to mitigate the problem of structure embedding bias between the predicted and experimental protein structures. Extensive experiments have shown that our framework is model-agnostic and effective in improving the property prediction of both predicted structures and experimental structures. The benchmark datasets and codes will be released to benefit the community.
翻译:基于蛋白质结构的特性预测已成为多种生物学任务(如蛋白质功能预测和亚细胞定位估计)的有效方法。现有方法高度依赖实验测定的蛋白质结构数据,但在这些数据不可用时则无法发挥作用。人工智能工具(如AlphaFold2)预测的蛋白质结构被用作替代方案。然而,我们观察到当前在推理阶段简单使用精确预测结构的做法,会导致预测精度显著下降。尽管类似现象在通用领域(如计算机视觉)中已作为模型鲁棒性被广泛研究,但其对蛋白质特性预测的影响尚未被探索。本文首先探究了利用预测结构时性能下降的原因,将其归因于结构表示学习中的结构嵌入偏差。为研究该问题,我们提出了鲁棒蛋白质特性预测的蛋白质三维图结构学习问题(PGSL-RP3),构建了基准数据集,并提出了一种蛋白质结构嵌入对齐优化框架(SAO),以缓解预测结构与实验结构之间的结构嵌入偏差。大量实验表明,我们的框架具有模型无关性,能有效改善预测结构和实验结构的特性预测。基准数据集和代码将公开发布,以惠及该领域的研究社区。