Scaffold Splits Overestimate Virtual Screening Performance

Virtual Screening (VS) of vast compound libraries guided by Artificial Intelligence (AI) models is a highly productive approach to early drug discovery. Data splitting is crucial for better benchmarking of such AI models. Traditional random data splits produce similar molecules between training and test sets, conflicting with the reality of VS libraries which mostly contain structurally distinct compounds. Scaffold split, grouping molecules by shared core structure, is widely considered to reflect this real-world scenario. However, here we show that the scaffold split also overestimates VS performance. The reason is that molecules with different chemical scaffolds are often similar, which hence introduces unrealistically high similarities between training molecules and test molecules following a scaffold split. Our study examined three representative AI models on 60 NCI-60 datasets, each with approximately 30,000 to 50,000 molecules tested on a different cancer cell line. Each dataset was split with three methods: scaffold, Butina clustering and the more accurate Uniform Manifold Approximation and Projection (UMAP) clustering. Regardless of the model, model performance is much worse with UMAP splits from the results of the 2100 models trained and evaluated for each algorithm and split. These robust results demonstrate the need for more realistic data splits to tune, compare, and select models for VS. For the same reason, avoiding the scaffold split is also recommended for other molecular property prediction problems. The code to reproduce these results is available at https://github.com/ScaffoldSplitsOverestimateVS

翻译：人工智能模型指导下的海量化合物库虚拟筛选是早期药物发现中一种高效的方法。数据分割对于更好地评估此类人工智能模型至关重要。传统的随机数据分割会在训练集和测试集中产生相似分子，这与虚拟筛选库主要包含结构不同化合物的实际情况相矛盾。骨架分割通过共享核心结构对分子进行分组，被广泛认为能反映这种真实场景。然而，本文研究表明骨架分割同样会高估虚拟筛选性能。其原因是具有不同化学骨架的分子往往具有相似性，这导致在骨架分割后，训练分子与测试分子之间仍存在不切实际的高度相似性。本研究在60个NCI-60数据集上检验了三种代表性人工智能模型，每个数据集包含约30,000至50,000个在不同癌细胞系上测试的分子。每个数据集采用三种方法进行分割：骨架分割、布蒂纳聚类以及更精确的均匀流形逼近与投影聚类。无论使用何种模型，在每种算法和分割方式下训练和评估的2100个模型结果显示，采用均匀流形逼近与投影分割时模型性能显著下降。这些稳健的结果表明，需要采用更真实的数据分割方法来调整、比较和选择虚拟筛选模型。基于相同原因，建议在其他分子性质预测问题中也避免使用骨架分割。重现这些结果的代码可在 https://github.com/ScaffoldSplitsOverestimateVS 获取。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Graph Transformer近期进展

专知会员服务

65+阅读 · 2023年1月5日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日