We present TerraBind, a foundation model for protein-ligand structure and binding affinity prediction that achieves 26-fold faster inference than state-of-the-art methods while improving affinity prediction accuracy by $\sim$20\%. Current deep learning approaches to structure-based drug design rely on expensive all-atom diffusion to generate 3D coordinates, creating inference bottlenecks that render large-scale compound screening computationally intractable. We challenge this paradigm with a critical hypothesis: full all-atom resolution is unnecessary for accurate small molecule pose and binding affinity prediction. TerraBind tests this hypothesis through a coarse pocket-level representation (protein C$_β$ atoms and ligand heavy atoms only) within a multimodal architecture combining COATI-3 molecular encodings and ESM-2 protein embeddings that learns rich structural representations, which are used in a diffusion-free optimization module for pose generation and a binding affinity likelihood prediction module. On structure prediction benchmarks (FoldBench, PoseBusters, Runs N' Poses), TerraBind matches diffusion-based baselines in ligand pose accuracy. Crucially, TerraBind outperforms Boltz-2 by $\sim$20\% in Pearson correlation for binding affinity prediction on both a public benchmark (CASP16) and a diverse proprietary dataset (18 biochemical/cell assays). We show that the affinity prediction module also provides well-calibrated affinity uncertainty estimates, addressing a critical gap in reliable compound prioritization for drug discovery. Furthermore, this module enables a continual learning framework and a hedged batch selection strategy that, in simulated drug discovery cycles, achieves 6$\times$ greater affinity improvement of selected molecules over greedy-based approaches.
翻译:我们提出TerraBind,一种用于蛋白质-配体结构与结合亲和力预测的基础模型,其推理速度比现有最优方法快26倍,同时将亲和力预测准确率提升约20%。当前基于结构的药物设计深度学习方法依赖昂贵的全原子扩散来生成三维坐标,由此产生的推理瓶颈使得大规模化合物筛选在计算上难以实现。我们通过一个关键假设挑战这一范式:精确的小分子构象与结合亲和力预测并不需要完整的全原子分辨率。TerraBind通过结合COATI-3分子编码与ESM-2蛋白质嵌入的多模态架构,采用粗粒度的口袋级表征(仅包含蛋白质C$_β$原子与配体重原子)来验证该假设,该架构学习丰富的结构表征,并应用于无扩散优化模块(用于构象生成)和结合亲和力似然预测模块。在结构预测基准测试(FoldBench、PoseBusters、Runs N' Poses)中,TerraBind在配体构象准确度方面与基于扩散的基线模型相当。关键的是,在公开基准(CASP16)和多样化专有数据集(18个生化/细胞检测)上,TerraBind在结合亲和力预测的皮尔逊相关性方面优于Boltz-2约20%。我们证明亲和力预测模块还能提供良好校准的亲和力不确定性估计,弥补了药物发现中可靠化合物优先级排序的关键空白。此外,该模块支持持续学习框架和风险对冲批量选择策略,在模拟药物发现周期中,相比基于贪婪策略的方法,所选分子的亲和力提升幅度达到6倍。