$\nabla^2$DFT: A Universal Quantum Chemistry Dataset of Drug-Like Molecules and a Benchmark for Neural Network Potentials

Kuzma Khrabrov,Anton Ber,Artem Tsypin,Konstantin Ushenin,Egor Rumiantsev,Alexander Telepov,Dmitry Protasov,Ilya Shenbin,Anton Alekseev,Mikhail Shirokikh,Sergey Nikolenko,Elena Tutubalina,Artur Kadurin

Methods of computational quantum chemistry provide accurate approximations of molecular properties crucial for computer-aided drug discovery and other areas of chemical science. However, high computational complexity limits the scalability of their applications. Neural network potentials (NNPs) are a promising alternative to quantum chemistry methods, but they require large and diverse datasets for training. This work presents a new dataset and benchmark called $\nabla^2$DFT that is based on the nablaDFT. It contains twice as much molecular structures, three times more conformations, new data types and tasks, and state-of-the-art models. The dataset includes energies, forces, 17 molecular properties, Hamiltonian and overlap matrices, and a wavefunction object. All calculations were performed at the DFT level ($\omega$B97X-D/def2-SVP) for each conformation. Moreover, $\nabla^2$DFT is the first dataset that contains relaxation trajectories for a substantial number of drug-like molecules. We also introduce a novel benchmark for evaluating NNPs in molecular property prediction, Hamiltonian prediction, and conformational optimization tasks. Finally, we propose an extendable framework for training NNPs and implement 10 models within it.

翻译：计算量子化学方法能够为计算机辅助药物发现及化学科学的其他领域提供对分子性质的精确近似。然而，其较高的计算复杂度限制了应用的可扩展性。神经网络势能（NNPs）是量子化学方法的一种有前景的替代方案，但其训练需要大规模且多样化的数据集。本研究提出了一个基于nablaDFT的新数据集和基准，称为$\nabla^2$DFT。它包含两倍于前者的分子结构、三倍于前者的构象、新的数据类型与任务，以及最先进的模型。该数据集包含能量、力、17种分子性质、哈密顿矩阵与重叠矩阵，以及一个波函数对象。所有计算均在DFT水平（$\omega$B97X-D/def2-SVP）上对每个构象进行。此外，$\nabla^2$DFT是首个包含大量类药物分子弛豫轨迹的数据集。我们还引入了一个新颖的基准，用于评估NNPs在分子性质预测、哈密顿量预测和构象优化任务中的性能。最后，我们提出了一个可扩展的NNPs训练框架，并在其中实现了10个模型。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日