Machine learning (ML) models are increasingly deployed for virtual screening in drug discovery, where the goal is to identify novel, chemically diverse scaffolds while minimizing experimental costs. This creates a fundamental challenge: the most valuable discoveries lie in out-of-distribution (OOD) regions beyond the training data, yet ML models often degrade under distribution shift. Standard novelty-rejection strategies ensure reliability within the training domain but limit discovery by rejecting precisely the novel scaffolds most worth finding. Moreover, experimental budgets permit testing only a small fraction of nominated candidates, demanding models that produce reliable confidence estimates. We introduce EXPLOR (Extrapolatory Pseudo-Label Matching for OOD Uncertainty-Based Rejection), a framework that addresses both challenges through extrapolatory pseudo-labeling on latent-space augmentations, requiring only a single labeled training set and no access to unlabeled test compounds, mirroring the realistic conditions of prospective screening campaigns. Through a multi-headed architecture with a novel per-head matching loss, EXPLOR learns to extrapolate to OOD chemical space while producing reliable confidence estimates, with particularly strong performance in high-confidence regions, which is critical for virtual screening where only top-ranked candidates advance to experimental validation. We demonstrate state-of-the-art performance across chemical and tabular benchmarks using different molecular embeddings.
翻译:机器学习(ML)模型越来越多地应用于药物发现中的虚拟筛选,其目标是识别新颖、化学多样性骨架的同时最小化实验成本。这产生了一个根本性挑战:最有价值的发现往往位于训练数据之外的超分布(OOD)区域,然而ML模型在分布偏移下性能通常会退化。标准的创新拒绝策略虽能确保训练域内的可靠性,但通过精确拒绝最值得发现的新颖骨架反而限制了探索。此外,实验预算只允许测试少量提名候选对象,这要求模型能产生可靠的置信度估计。我们提出EXPLOR(面向超分布不确定性拒绝的外推伪标签匹配框架),该框架通过隐空间增强下的外推伪标签化同时应对这两项挑战,仅需单次标记训练集且无需访问未标记测试化合物,完全符合前瞻性筛选活动的真实条件。通过新颖的逐头匹配损失函数驱动的多头架构,EXPLOR在生成可靠置信度估计的同时学习外推至超分布化学空间,在高置信度区域表现尤为突出——这对虚拟筛选至关重要,因为只有排名靠前的候选对象才会进入实验验证。我们采用不同分子嵌入方法在化学和表格基准测试中展示了最先进的性能。