Virtual Screening (VS) of vast compound libraries guided by Artificial Intelligence (AI) models is a highly productive approach to early drug discovery. Data splitting is crucial for better benchmarking of such AI models. Traditional random data splits produce similar molecules between training and test sets, conflicting with the reality of VS libraries which mostly contain structurally distinct compounds. Scaffold split, grouping molecules by shared core structure, is widely considered to reflect this real-world scenario. However, here we show that the scaffold split also overestimates VS performance. The reason is that molecules with different chemical scaffolds are often similar, which hence introduces unrealistically high similarities between training molecules and test molecules following a scaffold split. Our study examined three representative AI models on 60 NCI-60 datasets, each with approximately 30,000 to 50,000 molecules tested on a different cancer cell line. Each dataset was split with three methods: scaffold, Butina clustering and the more accurate Uniform Manifold Approximation and Projection (UMAP) clustering. Regardless of the model, model performance is much worse with UMAP splits from the results of the 2100 models trained and evaluated for each algorithm and split. These robust results demonstrate the need for more realistic data splits to tune, compare, and select models for VS. For the same reason, avoiding the scaffold split is also recommended for other molecular property prediction problems. The code to reproduce these results is available at https://github.com/ScaffoldSplitsOverestimateVS
翻译:人工智能模型指导下的海量化合物库虚拟筛选是早期药物发现中一种高效的方法。数据分割对于更好地评估此类人工智能模型至关重要。传统的随机数据分割会在训练集和测试集中产生相似分子,这与虚拟筛选库主要包含结构不同化合物的实际情况相矛盾。骨架分割通过共享核心结构对分子进行分组,被广泛认为能反映这种真实场景。然而,本文研究表明骨架分割同样会高估虚拟筛选性能。其原因是具有不同化学骨架的分子往往具有相似性,这导致在骨架分割后,训练分子与测试分子之间仍存在不切实际的高度相似性。本研究在60个NCI-60数据集上检验了三种代表性人工智能模型,每个数据集包含约30,000至50,000个在不同癌细胞系上测试的分子。每个数据集采用三种方法进行分割:骨架分割、布蒂纳聚类以及更精确的均匀流形逼近与投影聚类。无论使用何种模型,在每种算法和分割方式下训练和评估的2100个模型结果显示,采用均匀流形逼近与投影分割时模型性能显著下降。这些稳健的结果表明,需要采用更真实的数据分割方法来调整、比较和选择虚拟筛选模型。基于相同原因,建议在其他分子性质预测问题中也避免使用骨架分割。重现这些结果的代码可在 https://github.com/ScaffoldSplitsOverestimateVS 获取。