Protein design, a grand challenge of the day, involves optimization on a fitness landscape, and leading methods adopt a model-based approach where a model is trained on a training set (protein sequences and fitness) and proposes candidates to explore next. These methods are challenged by sparsity of high-fitness samples in the training set, a problem that has been in the literature. A less recognized but equally important problem stems from the distribution of training samples in the design space: leading methods are not designed for scenarios where the desired optimum is in a region that is not only poorly represented in training data, but also relatively far from the highly represented low-fitness regions. We show that this problem of "separation" in the design space is a significant bottleneck in existing model-based optimization tools and propose a new approach that uses a novel VAE as its search model to overcome the problem. We demonstrate its advantage over prior methods in robustly finding improved samples, regardless of the imbalance and separation between low- and high-fitness samples. Our comprehensive benchmark on real and semi-synthetic protein datasets as well as solution design for physics-informed neural networks, showcases the generality of our approach in discrete and continuous design spaces. Our implementation is available at https://github.com/sabagh1994/PGVAE.
翻译:蛋白质设计是当今的一项重大挑战,其核心在于适应度地形上的优化过程。主流方法采用基于模型的优化策略:首先在训练集(蛋白质序列及其适应度)上训练模型,随后由模型提出待探索的候选序列。现有方法长期面临训练集中高适应度样本稀疏的难题,这一问题已在文献中被广泛讨论。然而,一个较少被关注但同等重要的问题源于训练样本在设计空间中的分布特性:当目标最优解所处区域不仅缺乏训练数据表征,且与高表征度的低适应度区域相距较远时,现有主流方法往往难以有效应对。本文揭示了设计空间中这种“分离”现象已成为当前基于模型的优化工具的关键瓶颈,并提出一种采用新型变分自编码器作为搜索模型的新方法以克服该问题。实验表明,无论低适应度与高适应度样本之间存在何种不平衡性与分离程度,该方法均能稳健地发现优化样本,性能优于现有技术。我们在真实与半合成蛋白质数据集上进行了全面基准测试,同时结合物理信息神经网络的求解设计案例,验证了该方法在离散与连续设计空间中均具有普适性。代码实现已开源:https://github.com/sabagh1994/PGVAE。