The biological functions of proteins often depend on dynamic structural ensembles. In this work, we develop a flow-based generative modeling approach for learning and sampling the conformational landscapes of proteins. We repurpose highly accurate single-state predictors such as AlphaFold and ESMFold and fine-tune them under a custom flow matching framework to obtain sequence-conditoned generative models of protein structure called AlphaFlow and ESMFlow. When trained and evaluated on the PDB, our method provides a superior combination of precision and diversity compared to AlphaFold with MSA subsampling. When further trained on ensembles from all-atom MD, our method accurately captures conformational flexibility, positional distributions, and higher-order ensemble observables for unseen proteins. Moreover, our method can diversify a static PDB structure with faster wall-clock convergence to certain equilibrium properties than replicate MD trajectories, demonstrating its potential as a proxy for expensive physics-based simulations. Code is available at https://github.com/bjing2016/alphaflow.
翻译:蛋白质的生物功能往往依赖于动态的结构系综。本研究开发了一种基于流的生成式建模方法,用于学习和采样蛋白质的构象景观。我们重新利用AlphaFold和ESMFold等高精度单状态预测器,在自定义流匹配框架下进行微调,构建了名为AlphaFlow和ESMFlow的序列条件蛋白质结构生成模型。在PDB数据集上训练和评估时,该方法在精度与多样性方面均优于采用MSA子采样的AlphaFold。进一步在全原子分子动力学产生的系综上训练后,该方法能准确捕获未知蛋白质的构象柔性、位置分布及高阶系综可观测量。此外,该方法能以比重复分子动力学轨迹更快的实时收敛速度,将静态PDB结构多样化至特定平衡性质,展示了其替代昂贵物理模拟的潜力。代码见https://github.com/bjing2016/alphaflow。