Data generation remains a bottleneck in training surrogate models to predict molecular properties. We demonstrate that multitask Gaussian process regression overcomes this limitation by leveraging both expensive and cheap data sources. In particular, we consider training sets constructed from coupled-cluster (CC) and density function theory (DFT) data. We report that multitask surrogates can predict at CC level accuracy with a reduction to data generation cost by over an order of magnitude. Of note, our approach allows the training set to include DFT data generated by a heterogeneous mix of exchange-correlation functionals without imposing any artificial hierarchy on functional accuracy. More generally, the multitask framework can accommodate a wider range of training set structures -- including full disparity between the different levels of fidelity -- than existing kernel approaches based on $\Delta$-learning, though we show that the accuracy of the two approaches can be similar. Consequently, multitask regression can be a tool for reducing data generation costs even further by opportunistically exploiting existing data sources.
翻译:数据生成仍然是训练替代模型预测分子性质的一大瓶颈。我们证明,多任务高斯过程回归通过利用昂贵和廉价两种数据源克服了这一局限。具体而言,我们考虑了由耦合簇(CC)和密度泛函理论(DFT)数据构建的训练集。我们报告称,多任务替代模型能以低于一个数量级的数据生成成本,达到CC级别的预测精度。值得注意的是,我们的方法允许训练集包含由不同交换相关泛函的异构组合生成的DFT数据,而无需对泛函精度施加任何人为层次结构。更广泛地说,与基于$\Delta$-学习的现有核方法相比,多任务框架能适应更广泛的训练集结构——包括不同保真度之间的完全差异——尽管我们证明两种方法的精度可能相近。因此,通过机遇性地利用现有数据源,多任务回归可成为进一步降低数据生成成本的工具。