The past few years have seen the development of ``universal'' machine-learning interatomic potentials (uMLIPs) capable of approximating the ground-state potential energy surface across a wide range of chemical structures and compositions with reasonable accuracy. While these models differ in the architecture and the dataset used, they share the ability to compress a staggering amount of chemical information into descriptive latent features. Herein, we systematically analyze what the different uMLIPs have learned by quantitatively assessing the relative information content of their latent features with feature reconstruction errors, and observing how the trends are affected by the choice of training set and training protocol. We find that uMLIPs encode the chemical space in significantly distinct ways, with substantial cross-model feature reconstruction errors. When variants of the same model architecture are considered, trends become dependent on the dataset, target, and training protocol of choice. We also observe that fine-tuning of a uMLIP retains a strong pre-training bias in the latent features. Finally, we discuss how atom-level features, which are directly output by MLIPs, can be compressed into global structure-level features via concatenation of progressive cumulants, each adding significantly new information about the variability across the atomic environments within a given system.
翻译:近年来,已发展出能够以合理精度近似描述广泛化学结构和组成范围内基态势能面的"通用"机器学习原子间势(uMLIPs)。尽管这些模型在架构和所用数据集上存在差异,但它们都具备将海量化学信息压缩为描述性潜在特征的能力。本文通过特征重构误差定量评估不同uMLIPs潜在特征的相对信息含量,并观察训练集选择与训练方案如何影响特征趋势,从而系统分析不同uMLIPs的学习成果。研究发现,uMLIPs以显著不同的方式编码化学空间,存在显著的跨模型特征重构误差。当考虑相同架构的变体时,趋势则取决于所选数据集、训练目标及训练方案。我们还观察到,uMLIPs的微调会在潜在特征中保留强烈的预训练偏好。最后,我们探讨了如何通过渐进累积量的级联,将MLIPs直接输出的原子级特征压缩为全局结构级特征,其中每个累积量都为特定系统内原子环境的变化性提供了显著的新信息。