Protein fitness optimization is challenged by a vast combinatorial landscape where high-fitness variants are extremely sparse. Many current methods either underperform or require computationally expensive gradient-based sampling. We present CHASE, a framework that repurposes the evolutionary knowledge of pretrained protein language models by compressing their embeddings into a compact latent space. By training a conditional flow-matching model with classifier-free guidance, we enable the direct generation of high-fitness variants without predictor-based guidance during the ODE sampling steps. CHASE achieves state-of-the-art performance on AAV and GFP protein design benchmarks. Finally, we show that bootstrapping with synthetic data can further enhance performance in data-constrained settings.
翻译:蛋白质适应性优化面临巨大组合空间的挑战,其中高适应性变异体极为稀疏。现有方法往往性能不足或需要计算成本高昂的基于梯度的采样。本文提出CHASE框架,通过将预训练蛋白质语言模型的嵌入表示压缩至紧凑的潜在空间,从而重构其进化知识。通过训练具有无分类器引导的条件流匹配模型,我们能够在ODE采样步骤中直接生成高适应性变异体,而无需基于预测器的引导。CHASE在AAV和GFP蛋白质设计基准测试中取得了最先进的性能。最后,我们证明在数据受限场景中,使用合成数据进行自举训练可进一步提升模型性能。