Endowing Molecular Language with Geometry Perception via Modality Compensation for High-Throughput Quantum Hamiltonian Prediction

The quantum Hamiltonian is a fundamental property that governs a molecule's electronic structure and behavior, and its calculation and prediction are paramount in computational chemistry and materials science. Accurate prediction is highly reliant on extensive training data, including precise molecular geometries and the Hamiltonian matrices, which are expensive to acquire via either experimental or computational methods. Towards a fast yet accurate method for Hamiltonian prediction, we first introduce a geometry information-aware molecular language model to bypass the use of expensive molecular geometries by only using the readily available molecular language -- simplified molecular input line entry system (SMILES). Our method employs multimodal alignment to bridge the relationship between SMILES strings and their corresponding molecular geometries. Recognizing that the molecular language inherently lacks explicit geometric information, we propose a geometry modality compensation strategy to imbue molecular language representations with essential geometric features, thereby enabling accurate predictions using SMILES. In addition, given the high cost of acquiring Hamiltonian data, we devise a weakly supervised strategy to fine-tune the molecular language model, thus improving the data efficiency. Theoretically, we prove that the prediction generalization error without explicit molecular geometry can be bounded through our modality compensation scheme. Empirically, our method achieves superior computational efficiency, providing up to 100x speedup over conventional quantum mechanical methods while maintaining comparable prediction accuracy. We further demonstrate the practical case study of our approach in the screening of electrolyte formulations.

翻译：量子哈密顿量是决定分子电子结构与行为的基本属性，其计算与预测在计算化学与材料科学中至关重要。精确预测高度依赖于大量训练数据，包括精确的分子几何构型与哈密顿量矩阵，而通过实验或计算方法获取这些数据成本高昂。为实现快速而准确的哈密顿量预测，我们首先引入一种几何信息感知的分子语言模型，通过仅使用易于获取的分子语言——简化分子线性输入规范（SMILES）——来规避对昂贵分子几何构型数据的依赖。该方法采用多模态对齐技术，以桥接SMILES字符串与其对应分子几何构型之间的关系。鉴于分子语言本身缺乏显式的几何信息，我们提出一种几何模态补偿策略，为分子语言表征注入必要的几何特征，从而仅使用SMILES即可实现准确预测。此外，考虑到获取哈密顿量数据的高成本，我们设计了一种弱监督策略对分子语言模型进行微调，从而提升数据利用效率。理论上，我们证明了通过所提出的模态补偿方案，可以在不使用显式分子几何构型的情况下约束预测泛化误差。实证结果表明，我们的方法实现了卓越的计算效率，相较于传统量子力学方法可获得高达100倍的加速，同时保持可比的预测精度。我们进一步通过电解质配方筛选的实际案例研究展示了该方法的实用价值。