Methods of computational quantum chemistry provide accurate approximations of molecular properties crucial for computer-aided drug discovery and other areas of chemical science. However, high computational complexity limits the scalability of their applications. Neural network potentials (NNPs) are a promising alternative to quantum chemistry methods, but they require large and diverse datasets for training. This work presents a new dataset and benchmark called $\nabla^2$DFT that is based on the nablaDFT. It contains twice as much molecular structures, three times more conformations, new data types and tasks, and state-of-the-art models. The dataset includes energies, forces, 17 molecular properties, Hamiltonian and overlap matrices, and a wavefunction object. All calculations were performed at the DFT level ($\omega$B97X-D/def2-SVP) for each conformation. Moreover, $\nabla^2$DFT is the first dataset that contains relaxation trajectories for a substantial number of drug-like molecules. We also introduce a novel benchmark for evaluating NNPs in molecular property prediction, Hamiltonian prediction, and conformational optimization tasks. Finally, we propose an extendable framework for training NNPs and implement 10 models within it.
翻译:计算量子化学方法能够为计算机辅助药物发现及化学科学的其他领域提供对关键分子性质的精确近似。然而,其较高的计算复杂度限制了应用的可扩展性。神经网络势能是量子化学方法的一种有前景的替代方案,但其训练需要大规模且多样化的数据集。本研究提出了一个基于nablaDFT的新数据集和基准,称为$\nabla^2$DFT。它包含两倍数量的分子结构、三倍数量的构象、新的数据类型与任务,以及最先进的模型。该数据集包含能量、力、17种分子性质、哈密顿矩阵与重叠矩阵,以及一个波函数对象。所有计算均在DFT级别($\omega$B97X-D/def2-SVP)上针对每个构象进行。此外,$\nabla^2$DFT是首个包含大量类药物分子弛豫轨迹的数据集。我们还引入了一个新颖的基准,用于评估神经网络势能在分子性质预测、哈密顿量预测和构象优化任务中的性能。最后,我们提出了一个可扩展的神经网络势能训练框架,并在其中实现了10个模型。