Hamiltonian prediction is a versatile formulation to leverage machine learning for solving molecular science problems. Yet, its applicability is limited by insufficient labeled data for training. In this work, we highlight that Hamiltonian prediction possesses a self-consistency principle, based on which we propose an exact training method that does not require labeled data. This merit addresses the data scarcity difficulty, and distinguishes the task from other property prediction formulations with unique benefits: (1) self-consistency training enables the model to be trained on a large amount of unlabeled data, hence substantially enhances generalization; (2) self-consistency training is more efficient than labeling data with DFT for supervised training, since it is an amortization of DFT calculation over a set of molecular structures. We empirically demonstrate the better generalization in data-scarce and out-of-distribution scenarios, and the better efficiency from the amortization. These benefits push forward the applicability of Hamiltonian prediction to an ever larger scale.
翻译:哈密顿量预测是利用机器学习解决分子科学问题的一种通用建模方法。然而,其适用性受限于标记训练数据的不足。本研究指出,哈密顿量预测具有自洽性原理,并基于此提出了一种无需标记数据的精确训练方法。这一优势解决了数据稀缺难题,并使其区别于其他性质预测任务,具有以下独特益处:(1)自洽性训练使模型能够利用大量未标记数据进行训练,从而显著提升泛化能力;(2)相比使用DFT标记数据进行监督训练,自洽性训练效率更高,因为它将DFT计算成本分摊到一组分子结构上。我们通过实验证明了模型在数据稀缺和分布外场景下具有更强的泛化能力,以及分摊计算带来的效率提升。这些优势推动了哈密顿量预测向更大规模的应用迈进。