A significant challenge in computational chemistry is developing approximations that accelerate \emph{ab initio} methods while preserving accuracy. Machine learning interatomic potentials (MLIPs) have emerged as a promising solution for constructing atomistic potentials that can be transferred across different molecular and crystalline systems. Most MLIPs are trained only on energies and forces in vacuum, while an improved description of the potential energy surface could be achieved by including the curvature of the potential energy surface. We present Hessian QM9, the first database of equilibrium configurations and numerical Hessian matrices, consisting of 41,645 molecules from the QM9 dataset at the $\omega$B97x/6-31G* level. Molecular Hessians were calculated in vacuum, as well as water, tetrahydrofuran, and toluene using an implicit solvation model. To demonstrate the utility of this dataset, we show that incorporating second derivatives of the potential energy surface into the loss function of a MLIP significantly improves the prediction of vibrational frequencies in all solvent environments, thus making this dataset extremely useful for studying organic molecules in realistic solvent environments for experimental characterization.
翻译:计算化学中的一个重大挑战在于开发能够加速\emph{ab initio}方法同时保持精度的近似方法。机器学习原子间势(MLIPs)已成为构建可在不同分子和晶体系统间迁移的原子势的一种有前景的解决方案。大多数MLIPs仅在真空环境下的能量和力上进行训练,而通过纳入势能面的曲率信息,可以实现对势能面更精确的描述。我们提出了Hessian QM9,这是首个包含平衡构型和数值Hessian矩阵的数据库,包含来自QM9数据集的41,645个分子,计算级别为$\omega$B97x/6-31G*。分子Hessian矩阵在真空以及水、四氢呋喃和甲苯中使用隐式溶剂模型进行了计算。为了证明该数据集的实用性,我们展示了将势能面的二阶导数纳入MLIP损失函数中,能显著改善所有溶剂环境中振动频率的预测,从而使该数据集对于在真实溶剂环境中研究有机分子以进行实验表征极具价值。