Searching for potential active compounds in large databases is a necessary step to reduce time and costs in modern drug discovery pipelines. Such virtual screening methods seek to provide predictions that allow the search space to be narrowed down. Although cheminformatics has made great progress in exploiting the potential of available big data, caution is needed to avoid introducing bias and provide useful predictions with new compounds. In this work, we propose the decision-support tool ALMERIA (Advanced Ligand Multiconformational Exploration with Robust Interpretable Artificial Intelligence) for estimating compound similarities and activity prediction based on pairwise molecular contrasts while considering their conformation variability. The methodology covers the entire pipeline from data preparation to model selection and hyperparameter optimization. It has been implemented using scalable software and methods to exploit large volumes of data -- in the order of several terabytes -- , offering a very quick response even for a large batch of queries. The implementation and experiments have been performed in a distributed computer cluster using a benchmark, the public access DUD-E database. In addition to cross-validation, detailed data split criteria have been used to evaluate the models on different data partitions to assess their true generalization ability with new compounds. Experiments show state-of-the-art performance for molecular activity prediction (ROC AUC: $0.99$, $0.96$, $0.87$), proving that the chosen data representation and modeling have good properties to generalize. Molecular conformations -- prediction performance and sensitivity analysis -- have also been evaluated. Finally, an interpretability analysis has been performed using the SHAP method.
翻译:在大规模数据库中搜索潜在活性化合物是现代药物研发流程中降低时间与成本的关键步骤。此类虚拟筛选方法旨在提供预测以缩小搜索空间。尽管化学信息学在挖掘大数据的潜力方面取得了重大进展,但引入偏差的风险仍需警惕,以确保对新化合物提供有效预测。本研究提出决策支持工具ALMERIA(基于鲁棒可解释人工智能的先进配体多构象探索),通过考虑化合物构象变异性,基于成对分子对比实现相似性评估与活性预测。该方法覆盖从数据准备到模型选择及超参数优化的完整流程,采用可扩展软件与方法处理数TB级的大规模数据,即使面对大量查询亦能实现极速响应。基于公共DUD-E数据库基准,我们在分布式计算机集群上完成了算法实现与实验验证。除交叉验证外,研究采用精细化数据划分准则,通过不同数据集分区评估模型对新化合物的真实泛化能力。实验表明,分子活性预测性能达国际先进水平(ROC AUC:$0.99$、$0.96$、$0.87$),证实所选数据表征与建模方法具备优秀的泛化特性。同时,对分子构象的预测性能及敏感性进行了评估,并采用SHAP方法完成可解释性分析。