Designing novel proteins with desired characteristics remains a significant challenge due to the large sequence space and the complexity of sequence-function relationships. Efficient exploration of this space to identify sequences that meet specific design criteria is crucial for advancing therapeutics and biotechnology. Here, we present BoGA (Bayesian Optimization Genetic Algorithm), a framework that combines evolutionary search with Bayesian optimization to efficiently navigate the sequence space. By integrating a genetic algorithm as a stochastic proposal generator within a surrogate modeling loop, BoGA prioritizes candidates based on prior evaluations and surrogate model predictions, enabling data-efficient optimization. We demonstrate the utility of BoGA through benchmarking on sequence and structure design tasks, followed by its application in designing peptide binders against pneumolysin, a key virulence factor of \textit{Streptococcus pneumoniae}. BoGA accelerates the discovery of high-confidence binders, demonstrating the potential for efficient protein design across diverse objectives. The algorithm is implemented within the BoPep suite and is available under an MIT license at \href{https://github.com/ErikHartman/bopep}{GitHub}.
翻译:由于巨大的序列空间以及序列-功能关系的复杂性,设计具有所需特性的新型蛋白质仍然是一项重大挑战。高效探索该空间以识别满足特定设计标准的序列,对于推进治疗学和生物技术至关重要。本文提出BoGA(贝叶斯优化遗传算法),这是一个将进化搜索与贝叶斯优化相结合的框架,用于高效导航序列空间。通过将遗传算法作为随机提议生成器集成到代理模型循环中,BoGA能够基于先前的评估和代理模型预测对候选序列进行优先级排序,从而实现数据高效的优化。我们通过在序列和结构设计任务上进行基准测试,并随后将其应用于设计针对肺炎链球菌关键毒力因子肺炎溶素(pneumolysin)的肽结合剂,展示了BoGA的实用性。BoGA加速了高置信度结合剂的发现,证明了其在多样化目标下实现高效蛋白质设计的潜力。该算法在BoPep套件中实现,并可在\href{https://github.com/ErikHartman/bopep}{GitHub}上根据MIT许可证获取。