After the recent ground-breaking advances in protein structure prediction, one of the remaining challenges in protein machine learning is to reliably predict distributions of structural states. Parametric models of fluctuations are difficult to fit due to complex covariance structures between degrees of freedom in the protein chain, often causing models to either violate local or global structural constraints. In this paper, we present a new strategy for modelling protein densities in internal coordinates, which uses constraints in 3D space to induce covariance structure between the internal degrees of freedom. We illustrate the potential of the procedure by constructing a variational autoencoder with full covariance output induced by the constraints implied by the conditional mean in 3D, and demonstrate that our approach makes it possible to scale density models of internal coordinates to full protein backbones in two settings: 1) a unimodal setting for proteins exhibiting small fluctuations and limited amounts of available data, and 2) a multimodal setting for larger conformational changes in a high data regime.
翻译:在蛋白质结构预测取得突破性进展之后,蛋白质机器学习领域尚存的挑战之一是可靠预测结构状态的分布。由于蛋白质链自由度的复杂协方差结构,波动参数的模型往往难以拟合,这常常导致模型要么违反局部结构约束,要么违反全局结构约束。本文提出了一种新的内部坐标蛋白质密度建模策略,该策略利用三维空间中的约束来诱导内部自由度之间的协方差结构。我们通过构建一个变分自编码器来展示该方法的潜力,该编码器的全协方差输出由三维条件均值所隐含的约束诱导产生。我们证明,该方法使得内部坐标密度模型能够在两种场景下扩展至完整的蛋白质主链:1) 针对波动较小且可用数据有限的蛋白质的单峰设置,以及2) 针对高数据量下较大构象变化的多峰设置。