With the emergence of Transformer architectures and their powerful understanding of textual data, a new horizon has opened up to predict the molecular properties based on text description. While SMILES are the most common form of representation, they are lacking robustness, rich information and canonicity, which limit their effectiveness in becoming generalizable representations. Here, we present GPT-MolBERTa, a self-supervised large language model (LLM) which uses detailed textual descriptions of molecules to predict their properties. A text based description of 326000 molecules were collected using ChatGPT and used to train LLM to learn the representation of molecules. To predict the properties for the downstream tasks, both BERT and RoBERTa models were used in the finetuning stage. Experiments show that GPT-MolBERTa performs well on various molecule property benchmarks, and approaching state of the art performance in regression tasks. Additionally, further analysis of the attention mechanisms show that GPT-MolBERTa is able to pick up important information from the input textual data, displaying the interpretability of the model.
翻译:随着Transformer架构的出现及其对文本数据的强大理解能力,基于文本描述预测分子性质的新领域得以开辟。尽管SMILES是最常见的表示形式,但其缺乏鲁棒性、丰富信息和规范性,这限制了它们成为通用表示的有效性。本文提出GPT-MolBERTa,一种自监督大型语言模型(LLM),该模型利用分子的详细文本描述来预测其性质。通过ChatGPT收集了326000个分子的文本描述,并用于训练LLM学习分子表示。在下游任务的性质预测中,微调阶段同时使用了BERT和RoBERTa模型。实验表明,GPT-MolBERTa在各种分子性质基准测试中表现良好,在回归任务中接近当前最优性能。此外,对注意力机制的进一步分析显示,GPT-MolBERTa能够从输入的文本数据中提取重要信息,展现了模型的可解释性。