Recently, artificial intelligence for drug discovery has raised increasing interest in both machine learning and chemistry domains. The fundamental building block for drug discovery is molecule geometry and thus, the molecule's geometrical representation is the main bottleneck to better utilize machine learning techniques for drug discovery. In this work, we propose a pretraining method for molecule joint auto-encoding (MoleculeJAE). MoleculeJAE can learn both the 2D bond (topology) and 3D conformation (geometry) information, and a diffusion process model is applied to mimic the augmented trajectories of such two modalities, based on which, MoleculeJAE will learn the inherent chemical structure in a self-supervised manner. Thus, the pretrained geometrical representation in MoleculeJAE is expected to benefit downstream geometry-related tasks. Empirically, MoleculeJAE proves its effectiveness by reaching state-of-the-art performance on 15 out of 20 tasks by comparing it with 12 competitive baselines.
翻译:近期,人工智能驱动的药物发现在机器学习和化学领域均引起了日益增长的兴趣。药物发现的基本构建单元是分子几何结构,因此分子的几何表征是更好利用机器学习技术进行药物发现的主要瓶颈。本文提出一种分子联合自编码预训练方法(MoleculeJAE)。MoleculeJAE能够同时学习二维化学键(拓扑结构)与三维构象(几何结构)信息,并通过扩散过程模型模拟这两种模态的增广轨迹。基于此,MoleculeJAE将以自监督方式学习分子内在化学结构。因此,MoleculeJAE中预训练的几何表征有望为下游几何相关任务提供助力。实验表明,与12个竞争基线相比,MoleculeJAE在20项任务中的15项上达到最优性能,验证了其有效性。