基于知识蒸馏的神经网络在蛋白质结合亲和力预测中的研究 (Investigating Knowledge Distillation Through Neural Networks for Protein Binding Affinity Prediction)

The trade-off between predictive accuracy and data availability makes it difficult to predict protein--protein binding affinity accurately. The lack of experimentally resolved protein structures limits the performance of structure-based machine learning models, which generally outperform sequence-based methods. In order to overcome this constraint, we suggest a regression framework based on knowledge distillation that uses protein structural data during training and only needs sequence data during inference. The suggested method uses binding affinity labels and intermediate feature representations to jointly supervise the training of a sequence-based student network under the guidance of a structure-informed teacher network. Leave-One-Complex-Out (LOCO) cross-validation was used to assess the framework on a non-redundant protein--protein binding affinity benchmark dataset. A maximum Pearson correlation coefficient (P_r) of 0.375 and an RMSE of 2.712 kcal/mol were obtained by sequence-only baseline models, whereas a P_r of 0.512 and an RMSE of 2.445 kcal/mol were obtained by structure-based models. With a P_r of 0.481 and an RMSE of 2.488 kcal/mol, the distillation-based student model greatly enhanced sequence-only performance. Improved agreement and decreased bias were further confirmed by thorough error analyses. With the potential to close the performance gap between sequence-based and structure-based models as larger datasets become available, these findings show that knowledge distillation is an efficient method for transferring structural knowledge to sequence-based predictors. The source code for running inference with the proposed distillation-based binding affinity predictor can be accessed at https://github.com/wajidarshad/ProteinAffinityKD.

翻译：预测准确性与数据可用性之间的权衡使得精确预测蛋白质-蛋白质结合亲和力变得困难。实验解析的蛋白质结构缺乏限制了基于结构的机器学习模型的性能，而这类模型通常优于基于序列的方法。为克服这一限制，我们提出一种基于知识蒸馏的回归框架，该框架在训练阶段利用蛋白质结构数据，而在推理阶段仅需序列数据。所提出的方法利用结合亲和力标签和中间特征表示，在基于结构信息的教师网络指导下，联合监督基于序列的学生网络训练。我们在非冗余的蛋白质-蛋白质结合亲和力基准数据集上采用留一复合物交叉验证（LOCO）评估该框架。纯序列基线模型获得的最大皮尔逊相关系数（P_r）为0.375，均方根误差（RMSE）为2.712 kcal/mol；而基于结构的模型获得P_r为0.512，RMSE为2.445 kcal/mol。基于蒸馏的学生模型取得了P_r为0.481和RMSE为2.488 kcal/mol的结果，显著提升了纯序列模型的性能。详细的误差分析进一步证实了预测结果一致性改善和偏差降低。这些发现表明，知识蒸馏是将结构知识迁移至基于序列预测器的有效方法，随着更大规模数据集的可用，该方法有望弥合基于序列与基于结构模型之间的性能差距。基于所提蒸馏方法的结合亲和力预测器的推理源代码可在https://github.com/wajidarshad/ProteinAffinityKD获取。