Pretraining molecular representations from large unlabeled data is essential for molecular property prediction due to the high cost of obtaining ground-truth labels. While there exist various 2D graph-based molecular pretraining approaches, these methods struggle to show statistically significant gains in predictive performance. Recent work have thus instead proposed 3D conformer-based pretraining under the task of denoising, which led to promising results. During downstream finetuning, however, models trained with 3D conformers require accurate atom-coordinates of previously unseen molecules, which are computationally expensive to acquire at scale. In light of this limitation, we propose D&D, a self-supervised molecular representation learning framework that pretrains a 2D graph encoder by distilling representations from a 3D denoiser. With denoising followed by cross-modal knowledge distillation, our approach enjoys use of knowledge obtained from denoising as well as painless application to downstream tasks with no access to accurate conformers. Experiments on real-world molecular property prediction datasets show that the graph encoder trained via D&D can infer 3D information based on the 2D graph and shows superior performance and label-efficiency against other baselines.
翻译:从大规模无标注数据中预训练分子表示对分子性质预测至关重要,因为获取真实标注的成本高昂。尽管存在多种基于2D图的分子预训练方法,但这些方法在提升预测性能上难以展现统计显著性优势。因此,近期工作转而提出基于3D构象的去噪任务预训练方法,并取得了令人鼓舞的结果。然而,在下游微调过程中,使用3D构象训练的模型需要精确获取未知分子的原子坐标,而大规模获取这些坐标的计算成本极高。针对这一局限,我们提出D&D——一种自监督分子表征学习框架,通过从3D去噪器中蒸馏表征来预训练2D图编码器。通过去噪与跨模态知识蒸馏的协同,本方法既充分利用了去噪过程获取的知识,又能在无需精确构象的情况下轻松应用于下游任务。在真实分子性质预测数据集上的实验表明,经D&D训练的图编码器能够基于2D图推断3D信息,并在性能与标注效率上显著优于其他基线方法。