Molecular property prediction has gained significant attention due to its transformative potential in multiple scientific disciplines. Conventionally, a molecule graph can be represented either as a graph-structured data or a SMILES text. Recently, the rapid development of Large Language Models (LLMs) has revolutionized the field of NLP. Although it is natural to utilize LLMs to assist in understanding molecules represented by SMILES, the exploration of how LLMs will impact molecular property prediction is still in its early stage. In this work, we advance towards this objective through two perspectives: zero/few-shot molecular classification, and using the new explanations generated by LLMs as representations of molecules. To be specific, we first prompt LLMs to do in-context molecular classification and evaluate their performance. After that, we employ LLMs to generate semantically enriched explanations for the original SMILES and then leverage that to fine-tune a small-scale LM model for multiple downstream tasks. The experimental results highlight the superiority of text explanations as molecular representations across multiple benchmark datasets, and confirm the immense potential of LLMs in molecular property prediction tasks. Codes are available at \url{https://github.com/ChnQ/LLM4Mol}.
翻译:分子属性预测因其在多个科学学科中的变革潜力而备受关注。传统上,分子图可以表示成图结构数据或SMILES文本形式。近年来,大型语言模型(LLMs)的快速发展彻底革新了自然语言处理领域。尽管利用LLMs辅助理解SMILES表示的分子是自然之举,但关于LLMs如何影响分子属性预测的探索仍处于早期阶段。在本工作中,我们通过两个视角推进这一目标:零样本/少样本分子分类,以及利用LLMs生成的新解释作为分子表征。具体而言,我们首先引导LLMs进行上下文分子分类并评估其性能。随后,我们利用LLMs为原始SMILES生成语义丰富的解释,并基于此微调小规模语言模型以完成多项下游任务。实验结果凸显了文本解释作为分子表征在多个基准数据集上的优越性,并证实了LLMs在分子属性预测任务中的巨大潜力。代码见\url{https://github.com/ChnQ/LLM4Mol}。