Progress in both Machine Learning (ML) and Quantum Chemistry (QC) methods have resulted in high accuracy ML models for QC properties. Datasets such as MD17 and WS22 have been used to benchmark these models at some level of QC method, or fidelity, which refers to the accuracy of the chosen QC method. Multifidelity ML (MFML) methods, where models are trained on data from more than one fidelity, have shown to be effective over single fidelity methods. Much research is progressing in this direction for diverse applications ranging from energy band gaps to excitation energies. One hurdle for effective research here is the lack of a diverse multifidelity dataset for benchmarking. We provide the quantum Chemistry MultiFidelity (CheMFi) dataset consisting of five fidelities calculated with the TD-DFT formalism. The fidelities differ in their basis set choice: STO-3G, 3-21G, 6-31G, def2-SVP, and def2-TZVP. CheMFi offers to the community a variety of QC properties such as vertical excitation properties and molecular dipole moments, further including QC computation times allowing for a time benefit benchmark of multifidelity models for ML-QC.
翻译:机器学习(ML)与量子化学(QC)方法的进展,已催生出能够高精度预测QC性质的ML模型。诸如MD17和WS22等数据集已被用于在特定QC方法(或称保真度,指所选QC方法的精度)水平上对这些模型进行基准测试。多保真度ML(MFML)方法,即在多于一种保真度的数据上训练模型,已被证明比单保真度方法更有效。目前,从能带隙到激发能等广泛应用领域的研究正朝着这个方向积极推进。然而,该领域有效研究的一个障碍是缺乏用于基准测试的、涵盖多样性的多保真度数据集。我们提供了量子化学多保真度(CheMFi)数据集,该数据集包含采用TD-DFT形式计算的五种保真度数据。这些保真度在基组选择上有所不同:STO-3G、3-21G、6-31G、def2-SVP和def2-TZVP。CheMFi为研究社区提供了多种QC性质,例如垂直激发性质与分子偶极矩,并进一步包含了QC计算时间,从而使得ML-QC多保真度模型的时间效益基准测试成为可能。