Forming a molecular candidate set that contains a wide range of potentially effective compounds is crucial to the success of drug discovery. While most databases and machine-learning-based generation models aim to optimize particular chemical properties, there is limited literature on how to properly measure the coverage of the chemical space by those candidates included or generated. This problem is challenging due to the lack of formal criteria to select good measures of the chemical space. In this paper, we propose a novel evaluation framework for measures of the chemical space based on two analyses: an axiomatic analysis with three intuitive axioms that a good measure should obey, and an empirical analysis on the correlation between a measure and a proxy gold standard. Using this framework, we are able to identify #Circles, a new measure of chemical space coverage, which is superior to existing measures both analytically and empirically. We further evaluate how well the existing databases and generation models cover the chemical space in terms of #Circles. The results suggest that many generation models fail to explore a larger space over existing databases, which leads to new opportunities for improving generation models by encouraging exploration.
翻译:形成包含广泛潜在有效化合物的分子候选集对药物发现的成功至关重要。尽管大多数数据库和基于机器学习的生成模型旨在优化特定化学性质,但关于如何恰当测量这些候选集所包含或生成的结构化学空间覆盖度的研究仍十分有限。由于缺乏选择良好化学空间测量指标的形式化标准,这一问题颇具挑战性。本文提出了一种新颖的化学空间测量评估框架,该框架基于两类分析:一是符合良好测量指标应遵循的三个直观公理的公理分析,二是测量指标与代理黄金标准之间相关性的实证分析。通过这一框架,我们识别出#Circles——一种新的化学空间覆盖度指标,无论在分析层面还是实证层面均优于现有指标。我们进一步利用#Circles评估了现有数据库和生成模型对化学空间的覆盖能力。结果表明,许多生成模型未能探索超越现有数据库的更大空间,这为通过鼓励探索来改进生成模型提供了新的机遇。