Large language models (LLMs) bear promise as a fast and accurate material modeling paradigm for evaluation, analysis, and design. Their vast number of trainable parameters necessitates a wealth of data to achieve accuracy and mitigate overfitting. However, experimental measurements are often limited and costly to obtain in sufficient quantities for finetuning. To this end, we present a physics-based training pipeline that tackles the pathology of data scarcity. The core enabler is a physics-based modeling framework that generates a multitude of synthetic data to align the LLM to a physically consistent initial state before finetuning. Our framework features a two-phase training strategy: (1) utilizing the large-in-amount while less accurate synthetic data for supervised pretraining, and (2) finetuning the phase-1 model with limited experimental data. We empirically demonstrate that supervised pretraining is vital to obtaining accurate finetuned LLMs, via the lens of learning polymer flammability metrics where cone calorimeter data is sparse.
翻译:大型语言模型(LLMs)作为一种快速、准确的材料建模范式,在评估、分析和设计方面展现出巨大潜力。其海量可训练参数需要丰富的数据支撑以实现准确性并缓解过拟合。然而,实验测量数据往往有限且获取成本高昂,难以满足微调所需的数据量。为此,我们提出一种基于物理的训练流程,以应对数据稀缺的难题。该流程的核心驱动力是一个基于物理的建模框架,该框架能生成大量合成数据,使LLM在微调前对齐至物理一致的初始状态。我们的框架采用两阶段训练策略:(1)利用数量庞大但精度较低的合成数据进行监督式预训练;(2)使用有限的实验数据对第一阶段模型进行微调。我们通过聚合物可燃性指标(锥形量热仪数据稀缺)的学习案例,实证证明了监督式预训练对于获得高精度微调后LLM的关键作用。