Molecular property optimization (MPO) problems are inherently challenging since they are formulated over discrete, unstructured spaces and the labeling process involves expensive simulations or experiments, which fundamentally limits the amount of available data. Bayesian optimization (BO) is a powerful and popular framework for efficient optimization of noisy, black-box objective functions (e.g., measured property values), thus is a potentially attractive framework for MPO. To apply BO to MPO problems, one must select a structured molecular representation that enables construction of a probabilistic surrogate model. Many molecular representations have been developed, however, they are all high-dimensional, which introduces important challenges in the BO process -- mainly because the curse of dimensionality makes it difficult to define and perform inference over a suitable class of surrogate models. This challenge has been recently addressed by learning a lower-dimensional encoding of a SMILE or graph representation of a molecule in an unsupervised manner and then performing BO in the encoded space. In this work, we show that such methods have a tendency to "get stuck," which we hypothesize occurs since the mapping from the encoded space to property values is not necessarily well-modeled by a Gaussian process. We argue for an alternative approach that combines numerical molecular descriptors with a sparse axis-aligned Gaussian process model, which is capable of rapidly identifying sparse subspaces that are most relevant to modeling the unknown property function. We demonstrate that our proposed method substantially outperforms existing MPO methods on a variety of benchmark and real-world problems. Specifically, we show that our method can routinely find near-optimal molecules out of a set of more than $>100$k alternatives within 100 or fewer expensive queries.
翻译:分子性质优化(MPO)问题本质上具有挑战性,因其在离散、非结构化空间上建模,且标记过程涉及昂贵的模拟或实验,从根本上限制了可用数据量。贝叶斯优化(BO)是一种强大且流行的框架,用于高效优化带噪声、黑箱目标函数(如测量性质值),因此是MPO潜在的有吸引力的框架。为了将BO应用于MPO问题,必须选择一种结构化的分子表示,以构建概率代理模型。尽管已开发多种分子表示,但它们均为高维表示,这给BO过程带来了重要挑战——主要因为维度灾难使得定义和推理合适的代理模型类别变得困难。近期通过无监督学习分子SMILES或图表示的低维编码,并在编码空间中执行BO的方法部分解决了这一挑战。本工作中,我们发现此类方法容易"陷入停滞",我们推测这是由于编码空间到性质值的映射不一定能通过高斯过程良好建模。我们提出一种替代方法,将数值分子描述符与稀疏轴对齐高斯过程模型相结合,该模型能快速识别与未知性质函数建模最相关的稀疏子空间。实验表明,我们的方法在多个基准和实际问题上显著优于现有MPO方法。具体而言,我们方法可在不超过100次昂贵查询中,从超过10万个候选分子中常规性地找到接近最优的分子。