DOCKSTRING: easy molecular docking yields better benchmarks for ligand design

The field of machine learning for drug discovery is witnessing an explosion of novel methods. These methods are often benchmarked on simple physicochemical properties such as solubility or general druglikeness, which can be readily computed. However, these properties are poor representatives of objective functions in drug design, mainly because they do not depend on the candidate's interaction with the target. By contrast, molecular docking is a widely successful method in drug discovery to estimate binding affinities. However, docking simulations require a significant amount of domain knowledge to set up correctly which hampers adoption. To this end, we present DOCKSTRING, a bundle for meaningful and robust comparison of ML models consisting of three components: (1) an open-source Python package for straightforward computation of docking scores; (2) an extensive dataset of docking scores and poses of more than 260K ligands for 58 medically-relevant targets; and (3) a set of pharmaceutically-relevant benchmark tasks including regression, virtual screening, and de novo design. The Python package implements a robust ligand and target preparation protocol that allows non-experts to obtain meaningful docking scores. Our dataset is the first to include docking poses, as well as the first of its size that is a full matrix, thus facilitating experiments in multiobjective optimization and transfer learning. Overall, our results indicate that docking scores are a more appropriate evaluation objective than simple physicochemical properties, yielding more realistic benchmark tasks and molecular candidates.

翻译：药物发现机器学习领域正在经历一种新型方法的爆炸。这些方法往往以简单的物理化学特性作为基准,例如溶解性或普通药物相似性,可以很容易地计算。但这些特性是药物设计客观功能的不良代表,主要是因为这些特性并不取决于候选人与目标的相互作用。相比之下,分子对齐是药物发现中广泛成功的估计约束性亲缘关系的方法。然而,对齐模拟需要大量的域知识来正确设置妨碍收养的域知识。为此,我们提出Dockstrgging,这是对ML模型进行有意义和强力比较的包,由三个部分组成:(1) 开放源的Python软件包,用于直接计算对接分;(2) 大量对齐分和配置超过260Kligands的数据集,用于58个医学相关目标;(3) 一套与药物相关的基准任务,包括回归、虚拟筛选和删除设计。 Pythsonmon软件包实施一个强有力的平面和目标准备协议,让非专家能够直接计算对接分分分分分数;(2) 将数据设置的分数和配置更多的对分数,从而显示我们的目标排序。