Coordinate denoising is a promising 3D molecular pre-training method, which has achieved remarkable performance in various downstream drug discovery tasks. Theoretically, the objective is equivalent to learning the force field, which is revealed helpful for downstream tasks. Nevertheless, there are two challenges for coordinate denoising to learn an effective force field, i.e. low coverage samples and isotropic force field. The underlying reason is that molecular distributions assumed by existing denoising methods fail to capture the anisotropic characteristic of molecules. To tackle these challenges, we propose a novel hybrid noise strategy, including noises on both dihedral angel and coordinate. However, denoising such hybrid noise in a traditional way is no more equivalent to learning the force field. Through theoretical deductions, we find that the problem is caused by the dependency of the input conformation for covariance. To this end, we propose to decouple the two types of noise and design a novel fractional denoising method (Frad), which only denoises the latter coordinate part. In this way, Frad enjoys both the merits of sampling more low-energy structures and the force field equivalence. Extensive experiments show the effectiveness of Frad in molecular representation, with a new state-of-the-art on 9 out of 12 tasks of QM9 and on 7 out of 8 targets of MD17.
翻译:坐标去噪是一种有前景的三维分子预训练方法,已在各种下游药物发现任务中取得显著性能。理论上,该目标等价于学习力场,已被证明有助于下游任务。然而,坐标去噪在学习有效力场时面临两个挑战,即低覆盖样本和各向同性的力场。根本原因在于现有去噪方法假设的分子分布未能捕捉分子的各向异性特征。为应对这些挑战,我们提出了一种新颖的混合噪声策略,包括二面角噪声和坐标噪声。然而,以传统方式对此类混合噪声进行去噪不再等价于学习力场。通过理论推导,我们发现该问题源于输入构象对协方差的依赖性。为此,我们提出解耦这两类噪声,并设计了一种新颖的分数去噪方法(Frad),仅对后者的坐标部分进行去噪。通过这种方式,Frad既享有采样更多低能结构的优势,又保持了力场的等价性。大量实验证明了Frad在分子表示中的有效性,在QM9的12个任务中有9个、MD17的8个目标中有7个达到了新的最先进水平。