In this paper, we formulate the new multi-objective coverage (MOC) problem where our goal is to identify a small set of representative samples whose predicted outcomes broadly cover the feasible multi-objective space. This problem is of great importance in many critical real-world applications, e.g., drug discovery and materials design, as this representative set can be evaluated much faster than the whole feasible set, thus significantly accelerating the scientific discovery process. Existing works cannot be directly applied as they either focus on sample space coverage or multi-objective optimization that targets the Pareto front. However, chemically diverse samples often yield identical objective profiles, and safety constraints are usually defined on the objectives. To solve this MOC problem, we propose a novel search algorithm, MOC-CAS, which employs an upper confidence bound-based acquisition function to select optimistic samples guided by Gaussian process posterior predictions. For enabling efficient optimization, we develop a smoothed relaxation of the hard feasibility test and derive an approximate optimizer. Compared to the competitive baselines, we show that our MOC-CAS empirically achieves superior performances across large-scale protein-target datasets for SARS-CoV-2 and cancer, each assessed on five objectives derived from SMILES-based features.
翻译:本文提出了一种新的多目标覆盖问题,其目标在于识别一组具有代表性的小规模样本,使其预测结果能够广泛覆盖可行的多目标空间。该问题在许多关键现实应用中具有重要意义,例如药物发现和材料设计,因为相较于整个可行集,此类代表性集合的评估速度可大幅提升,从而显著加速科学发现进程。现有研究无法直接应用,因为它们要么聚焦于样本空间覆盖,要么专注于针对帕累托前沿的多目标优化。然而,化学多样性样本常产生相同的目标分布,且安全约束通常定义于目标空间。为解决这一多目标覆盖问题,我们提出了一种新颖的搜索算法MOC-CAS,该算法采用基于置信上界的采集函数,在高斯过程后验预测的引导下选择乐观样本。为实现高效优化,我们开发了硬可行性检验的平滑松弛方法,并推导出近似优化器。通过与竞争性基线方法进行比较,我们证明MOC-CAS在针对SARS-CoV-2和癌症的大规模蛋白质靶点数据集上均取得了更优的实证性能,每个数据集均基于从SMILES特征衍生的五个目标进行评估。