Despite the prevalence and many successes of deep learning applications in de novo molecular design, the problem of peptide generation targeting specific proteins remains unsolved. A main barrier for this is the scarcity of the high-quality training data. To tackle the issue, we propose a novel machine learning based peptide design architecture, called Latent Space Approximate Trajectory Collector (LSATC). It consists of a series of samplers on an optimization trajectory on a highly non-convex energy landscape that approximates the distributions of peptides with desired properties in a latent space. The process involves little human intervention and can be implemented in an end-to-end manner. We demonstrate the model by the design of peptide extensions targeting Beta-catenin, a key nuclear effector protein involved in canonical Wnt signalling. When compared with a random sampler, LSATC can sample peptides with $36\%$ lower binding scores in a $16$ times smaller interquartile range (IQR) and $284\%$ less hydrophobicity with a $1.4$ times smaller IQR. LSATC also largely outperforms other common generative models. Finally, we utilized a clustering algorithm to select 4 peptides from the 100 LSATC designed peptides for experimental validation. The result confirms that all the four peptides extended by LSATC show improved Beta-catenin binding by at least $20.0\%$, and two of the peptides show a $3$ fold increase in binding affinity as compared to the base peptide.
翻译:尽管深度学习在从头分子设计中广泛应用并取得了诸多成功,但针对特定蛋白质的肽生成问题仍未解决。其主要障碍在于缺乏高质量的训练数据。为解决这一问题,我们提出了一种基于机器学习的新型肽设计架构,称为潜在空间近似轨迹收集器(LSATC)。该架构由一系列在高度非凸能量景观的优化轨迹上的采样器组成,这些采样器近似于在潜在空间中具有目标特性的肽的分布。该过程几乎不需要人工干预,并且可以以端到端的方式实现。我们通过设计靶向β-连环蛋白(一种参与经典Wnt信号通路的关键核效应蛋白)的肽延伸来展示该模型。与随机采样器相比,LSATC可以采样到结合分数降低36%、四分位距(IQR)缩小16倍、疏水性减少284%且IQR缩小1.4倍的肽。LSATC的性能也大幅优于其他常见生成模型。最后,我们利用聚类算法从LSATC设计的100个肽中选出4个进行实验验证。结果证实,LSATC延伸的所有四种肽对β-连环蛋白的结合能力均提升至少20.0%,其中两种肽与基础肽相比,结合亲和力提高了3倍。