Designing ligands that are both chemically valid and structurally compatible with protein binding pockets is a key bottleneck in computational drug discovery. Existing approaches either ignore structural context or rely on expensive, memory-intensive encoding that limits throughput and scalability. We present SiDGen (Structure-informed Diffusion Generator), a protein-conditioned diffusion framework that integrates masked SMILES generation with lightweight folding-derived features for pocket awareness. To balance expressivity with efficiency, SiDGen supports two conditioning pathways: a streamlined mode that pools coarse structural signals from protein embeddings and a full mode that injects localized pairwise biases for stronger coupling. A coarse-stride folding mechanism with nearest-neighbor upsampling alleviates the quadratic memory costs of pair tensors, enabling training on realistic sequence lengths. Learning stability is maintained through in-loop chemical validity checks and an invalidity penalty, while large-scale training efficiency is restored \textit{via} selective compilation, dataloader tuning, and gradient accumulation. In automated benchmarks, SiDGen generates ligands with high validity, uniqueness, and novelty, while achieving competitive performance in docking-based evaluations and maintaining reasonable molecular properties. These results demonstrate that SiDGen can deliver scalable, pocket-aware molecular design, providing a practical route to conditional generation for high-throughput drug discovery.
翻译:设计既满足化学有效性又与蛋白质结合口袋结构兼容的配体,是计算药物发现领域的关键瓶颈。现有方法要么忽略结构上下文,要么依赖昂贵且内存密集的编码方式,限制了处理通量和可扩展性。我们提出SiDGen(结构感知扩散生成器),一种蛋白质条件扩散框架,它将掩码SMILES生成与轻量级折叠衍生特征相结合,以实现对结合口袋的感知。为平衡表达能力与效率,SiDGen支持两种条件化路径:一种是从蛋白质嵌入中汇集粗粒度结构信号的流线型模式,另一种是注入局部成对偏置以实现更强耦合的完整模式。采用最近邻上采样的粗步长折叠机制缓解了配对张量的二次内存开销,使得在真实序列长度上进行训练成为可能。通过循环内化学有效性检查和无效性惩罚来保持学习稳定性,同时通过选择性编译、数据加载器调优和梯度累积恢复大规模训练效率。在自动化基准测试中,SiDGen生成的配体具有高有效性、独特性和新颖性,同时在基于对接的评估中取得有竞争力的性能,并保持合理的分子性质。这些结果表明,SiDGen能够实现可扩展的、口袋感知的分子设计,为高通量药物发现中的条件生成提供了一条实用路径。