Multifidelity machine learning (MFML) for quantum chemical (QC) properties has seen strong development in the recent years. The method has been shown to reduce the cost of generating training data for high-accuracy low-cost ML models. In such a set-up, the ML models are trained on molecular geometries and some property of interest computed at various computational chemistry accuracies, or fidelities. These are then combined in training the MFML models. In some multifidelity models, the training data is required to be nested, that is the same molecular geometries are included to calculate the property across all the fidelities. In these multifidelity models, the requirement of a nested configuration restricts the kind of sampling that can be performed while selection training samples at different fidelities. This work assesses the use of non-nested training data for two of these multifidelity methods, namely MFML and optimized MFML (o-MFML). The assessment is carried out for the prediction of ground state energies and first vertical excitation energies of a diverse collection of molecules of the CheMFi dataset. Results indicate that the MFML method still requires a nested structure of training data across the fidelities. However, the o-MFML method shows promising results for non-nested multifidelity training data with model errors comparable to the nested configurations.
翻译:近年来,量子化学(QC)性质的多保真度机器学习(MFML)取得了显著进展。该方法已被证明能够降低为高精度、低成本的机器学习模型生成训练数据的成本。在此类设置中,机器学习模型基于分子几何结构以及在各种计算化学精度(即保真度)下计算得到的特定性质进行训练。这些数据随后被整合用于训练多保真度机器学习模型。在某些多保真度模型中,训练数据需要满足嵌套性要求,即相同的分子几何结构需包含在所有保真度级别的性质计算中。这种嵌套配置的要求限制了在不同保真度下选择训练样本时可采用的采样方式。本研究评估了两种多保真度方法——MFML 和优化 MFML(o-MFML)——使用非嵌套训练数据的效果。评估针对 CheMFi 数据集中多样化分子集合的基态能量和第一垂直激发能量的预测进行。结果表明,MFML 方法仍需跨保真度的训练数据具有嵌套结构。然而,o-MFML 方法在非嵌套多保真度训练数据上展现出良好前景,其模型误差与嵌套配置相当。