Chemical Language Models (CLMs) pre-trained on large scale molecular data are widely used for molecular property prediction. However, the common belief that increasing training resources such as model size, dataset size, and training compute improves both pretraining loss and downstream task performance has not been systematically validated in the chemical domain. In this work, we evaluate this assumption by pretraining CLMs while scaling training resources and measuring transfer performance across diverse molecular property prediction (MPP) tasks. We find that while pretraining loss consistently decreases with increased training resources, downstream task performance shows limited improvement. Moreover, alternative metrics based on the Hessian or loss landscape also fail to estimate downstream performance in CLMs. We further identify conditions under which downstream performance saturates or degrades despite continued improvements in pretraining metrics, and analyze the underlying task dependent failure modes through parameter space visualizations. These results expose a gap between pretraining based evaluation and downstream performance, and emphasize the need for model selection and evaluation strategies that explicitly account for downstream task characteristics.
翻译:在大规模分子数据上预训练的化学语言模型(CLMs)被广泛用于分子性质预测。然而,关于增加训练资源(如模型规模、数据集规模和训练计算量)能够同时改善预训练损失和下游任务性能的普遍观点,在化学领域尚未得到系统验证。本研究通过预训练CLMs并扩展训练资源,同时测量其在多样化分子性质预测(MPP)任务上的迁移性能,对这一假设进行了评估。我们发现,虽然预训练损失随着训练资源的增加持续下降,但下游任务性能的提升却十分有限。此外,基于Hessian矩阵或损失景观的替代指标同样无法有效估计CLMs的下游性能。我们进一步识别了下游性能在预训练指标持续改善的情况下趋于饱和或下降的条件,并通过参数空间可视化分析了其背后任务依赖的失效模式。这些结果揭示了基于预训练的评价指标与下游性能之间的差距,并强调了需要明确考虑下游任务特性的模型选择与评估策略。