Predicting physicochemical properties across chemical space is vital for chemical engineering, drug discovery, and materials science. Current molecular foundation models lack thermodynamic consistency, while domain-informed approaches are limited to single properties and small datasets. We introduce MultiPUFFIN, a domain-constrained multimodal foundation model addressing both limitations simultaneously. MultiPUFFIN features: (i) an encoder fusing SMILES, graphs, and 3D geometries via gated cross-modal attention, alongside experimental condition and descriptor encoders; (ii) prediction heads embedding established correlations (e.g., Wagner, Andrade, van't Hoff, and Shomate equations) as inductive biases to ensure thermodynamic consistency; and (iii) a two-stage multi-task training strategy.Extending prior frameworks, MultiPUFFIN predicts nine thermophysical properties simultaneously. It is trained on a multi-source dataset of 37,968 unique molecules (40,904 rows). With roughly 35 million parameters, MultiPUFFIN achieves a mean $R^2 = 0.716$ on a challenging scaffold-split test set of 8,877 molecules. Compared to ChemBERTa-2 (pre-trained on 77 million molecules), MultiPUFFIN outperforms the fine-tuned baseline across all nine properties despite using 2000x fewer training molecules. Advantages are strikingly apparent for temperature-dependent properties, where ChemBERTa-2 lacks the architectural capacity to incorporate thermodynamic conditions.These results demonstrate that multimodal encoding and domain-informed biases substantially reduce data and compute requirements compared to brute-force pre-training. Furthermore, MultiPUFFIN handles missing modalities and recovers meaningful thermodynamic parameters without explicit supervision. Systematic ablation studies confirm the property-specific benefits of these domain-informed prediction heads.
翻译:预测化学空间中的物理化学性质对于化学工程、药物发现和材料科学至关重要。当前的分子基础模型缺乏热力学一致性,而基于领域知识的方法仅限于单一性质和少量数据集。我们提出了MultiPUFFIN,一种同时解决这两个局限性的领域约束多模态基础模型。MultiPUFFIN具有以下特点:(i)一个通过门控跨模态注意力融合SMILES、图结构和3D几何信息的编码器,以及实验条件和描述符编码器;(ii)预测头将已建立的相关性(例如Wagner、Andrade、van't Hoff和Shomate方程)作为归纳偏置嵌入,以确保热力学一致性;(iii)一个两阶段多任务训练策略。MultiPUFFIN扩展了先前框架,可同时预测九种热物理性质。它在一个包含37,968个独特分子(40,904行数据)的多源数据集上进行训练。该模型拥有约3500万个参数,在一个包含8,877个分子的具有挑战性的骨架划分测试集上取得了平均$R^2 = 0.716$的成绩。与在7700万个分子上预训练的ChemBERTa-2相比,尽管MultiPUFFIN使用的训练分子数量少了2000倍,但在所有九种性质上均优于微调后的基线模型。对于温度依赖性性质,其优势尤为明显,因为ChemBERTa-2在架构上缺乏整合热力学条件的能力。这些结果表明,与暴力预训练相比,多模态编码和基于领域知识的偏置显著降低了数据和计算需求。此外,MultiPUFFIN能够处理缺失的模态,并在没有显式监督的情况下恢复有意义的热力学参数。系统的消融研究证实了这些基于领域知识的预测头对特定性质的益处。