This study introduces a systematic framework to compare the efficacy of Large Language Models (LLMs) for fine-tuning across various cheminformatics tasks. Employing a uniform training methodology, we assessed three well-known models-RoBERTa, BART, and LLaMA-on their ability to predict molecular properties using the Simplified Molecular Input Line Entry System (SMILES) as a universal molecular representation format. Our comparative analysis involved pre-training 18 configurations of these models, with varying parameter sizes and dataset scales, followed by fine-tuning them on six benchmarking tasks from DeepChem. We maintained consistent training environments across models to ensure reliable comparisons. This approach allowed us to assess the influence of model type, size, and training dataset size on model performance. Specifically, we found that LLaMA-based models generally offered the lowest validation loss, suggesting their superior adaptability across tasks and scales. However, we observed that absolute validation loss is not a definitive indicator of model performance - contradicts previous research - at least for fine-tuning tasks: instead, model size plays a crucial role. Through rigorous replication and validation, involving multiple training and fine-tuning cycles, our study not only delineates the strengths and limitations of each model type but also provides a robust methodology for selecting the most suitable LLM for specific cheminformatics applications. This research underscores the importance of considering model architecture and dataset characteristics in deploying AI for molecular property prediction, paving the way for more informed and effective utilization of AI in drug discovery and related fields.
翻译:本研究提出一个系统化框架,用于比较大型语言模型(LLMs)在各类化学信息学任务中的微调效能。采用统一训练方法,我们评估了三种知名模型——RoBERTa、BART和LLaMA——基于简化分子线性输入规范(SMILES)作为通用分子表征格式时预测分子性质的能力。对比分析涉及对18种不同参数规模与数据集大小的模型配置进行预训练,随后在DeepChem的六个基准任务上进行微调。我们确保所有模型保持一致的训练环境以保障比较的可靠性。此方法使我们能够评估模型类型、规模及训练数据集大小对模型性能的影响。具体而言,我们发现基于LLaMA的模型通常展现出最低的验证损失,表明其在不同任务与规模下具有更优的适应性。然而,我们观察到绝对验证损失并非模型性能的确定性指标——这与先前研究相悖——至少在微调任务中如此:相反,模型规模起到关键作用。通过包含多次训练与微调循环的严谨重复验证,本研究不仅厘清了各类模型的优势与局限,还为选择特定化学信息学应用中最适合的LLM提供了可靠方法。本研究强调了在分子性质预测中部署人工智能时需考虑模型架构与数据集特征的重要性,为药物发现及相关领域更明智且高效地利用人工智能铺平道路。