InstructPro: Natural Language Guided Ligand-Binding Protein Design

The de novo design of ligand-binding proteins with tailored functions is essential for advancing biotechnology and molecular medicine, yet existing AI approaches are limited by scarce protein-ligand complex data. To circumvent this data bottleneck, we leverage the abundant natural language descriptions characterizing protein-ligand interactions. Here, we introduce InstructPro, a family of generative models that design proteins following the guidance of natural language instructions and ligand formulas. InstructPro produces protein sequences consistent with specified function descriptions and ligand targets. To enable training and evaluation, we develop InstructProBench, a large-scale dataset of 9.6 million (function description, ligand, protein) triples. We train two model variants -- InstructPro-1B and InstructPro-3B -- that substantially outperform strong baselines. InstructPro-1B achieves an AlphaFold3 ipTM of 0.918 and a binding affinity of -8.764 on seen ligands, while maintaining robust performance in a zero-shot setting with scores of 0.869 and -6.713, respectively. These results are accompanied by novelty scores of 70.1% and 68.8%, underscoring the model's ability to generalize beyond the training set. Furthermore, the model yields a superior binding free energy of -20.9 kcal/mol and an average of 5.82 intermolecular hydrogen bonds, validating its proficiency in designing high-affinity ligand-binding proteins. Notably, scaling to InstructPro-3B further improves the zero-shot ipTM to 0.882, binding affinity to -6.797, and binding free energy to -25.8 kcal/mol, demonstrating clear performance gains associated with increased model capacity. These findings highlight the power of natural language-guided generative models to mitigate the data bottlenecks in traditional structure-based methods, significantly broadening the scope of de novo protein design.

翻译：具有定制功能的配体结合蛋白质从头设计对于推动生物技术和分子医学发展至关重要，但现有人工智能方法受限于稀缺的蛋白质-配体复合物数据。为规避这一数据瓶颈，我们利用描述蛋白质-配体相互作用的丰富自然语言表征。本文提出InstructPro系列生成模型，该模型遵循自然语言指令与配体分子式的引导进行蛋白质设计。InstructPro生成的蛋白质序列与指定的功能描述及配体靶标保持高度一致。为支持训练与评估，我们构建了包含960万组（功能描述、配体、蛋白质）三元对的大规模数据集InstructProBench。我们训练了两个模型变体——InstructPro-1B与InstructPro-3B，其性能显著超越现有基线模型。InstructPro-1B在已知配体上取得AlphaFold3 ipTM得分0.918与结合亲和力-8.764，同时在零样本场景中保持稳健性能，相应得分分别为0.869与-6.713。这些结果伴随70.1%与68.8%的新颖性评分，彰显了模型在训练集之外的泛化能力。此外，该模型产生-20.9 kcal/mol的优异结合自由能及平均5.82个分子间氢键，验证了其设计高亲和力配体结合蛋白质的卓越能力。值得注意的是，扩展至InstructPro-3B模型将零样本ipTM提升至0.882，结合亲和力提升至-6.797，结合自由能提升至-25.8 kcal/mol，清晰展现了模型容量增加带来的性能增益。这些发现凸显了自然语言引导生成模型在缓解传统基于结构方法的数据瓶颈方面的强大潜力，显著拓展了从头蛋白质设计的应用疆域。