Machine learning (ML) enables accurate and fast molecular property predictions, which are of interest in drug discovery and material design. Their success is based on the principle of similarity at its heart, assuming that similar molecules exhibit close properties. However, activity cliffs challenge this principle, and their presence leads to a sharp decline in the performance of existing ML algorithms, particularly graph-based methods. To overcome this obstacle under a low-data scenario, we propose a novel semi-supervised learning (SSL) method dubbed SemiMol, which employs predictions on numerous unannotated data as pseudo-signals for subsequent training. Specifically, we introduce an additional instructor model to evaluate the accuracy and trustworthiness of proxy labels because existing pseudo-labeling approaches require probabilistic outputs to reveal the model's confidence and fail to be applied in regression tasks. Moreover, we design a self-adaptive curriculum learning algorithm to progressively move the target model toward hard samples at a controllable pace. Extensive experiments on 30 activity cliff datasets demonstrate that SemiMol significantly enhances graph-based ML architectures and outpasses state-of-the-art pretraining and SSL baselines.
翻译:机器学习(ML)能够实现准确且快速的分子性质预测,这在药物发现和材料设计中具有重要意义。其成功基于其核心的相似性原理,即假设相似分子表现出相近的性质。然而,活性悬崖挑战了这一原理,它们的存在导致现有ML算法(尤其是基于图的方法)的性能急剧下降。为了在低数据场景下克服这一障碍,我们提出了一种新颖的半监督学习(SSL)方法,命名为SemiMol,该方法利用对大量未标注数据的预测结果作为后续训练的伪信号。具体而言,我们引入了一个额外的指导模型来评估代理标签的准确性和可信度,因为现有的伪标签方法需要概率输出来揭示模型的置信度,无法应用于回归任务。此外,我们设计了一种自适应课程学习算法,以可控的速度逐步将目标模型导向困难样本。在30个活性悬崖数据集上的大量实验表明,SemiMol显著增强了基于图的ML架构,并超越了最先进的预训练和SSL基线方法。