Deep learning has achieved remarkable success in learning representations for molecules, which is crucial for various biochemical applications, ranging from property prediction to drug design. However, training Deep Neural Networks (DNNs) from scratch often requires abundant labeled molecules, which are expensive to acquire in the real world. To alleviate this issue, tremendous efforts have been devoted to Molecular Pre-trained Models (MPMs), where DNNs are pre-trained using large-scale unlabeled molecular databases and then fine-tuned over specific downstream tasks. Despite the prosperity, there lacks a systematic review of this fast-growing field. In this paper, we present the first survey that summarizes the current progress of MPMs. We first highlight the limitations of training molecular representation models from scratch to motivate MPM studies. Next, we systematically review recent advances on this topic from several key perspectives, including molecular descriptors, encoder architectures, pre-training strategies, and applications. We also highlight the challenges and promising avenues for future research, providing a useful resource for both machine learning and scientific communities.
翻译:深度学习在分子表示学习方面取得了显著成功,这对于从性质预测到药物设计等多种生物化学应用至关重要。然而,从头训练深度神经网络通常需要大量带标签的分子样本,这些样本在现实世界中获取成本高昂。为解决这一问题,学术界投入了大量研究精力开发分子预训练模型,其核心思想是利用大规模无标注分子数据库预训练深度神经网络,再针对特定下游任务进行微调。尽管该领域发展迅速,但目前仍缺乏系统性综述。本文首次系统总结了分子预训练模型的研究进展。我们首先阐明从头训练分子表示模型的局限性以凸显预训练研究的必要性,继而从分子描述符、编码器架构、预训练策略及应用等多个关键维度系统回顾该领域最新进展。最后,我们指出当前挑战与未来研究方向,为机器学习界和科学界提供有价值的参考资料。