Generative pre-trained Transformer (GPT) has demonstrates its great success in natural language processing and related techniques have been adapted into molecular modeling. Considering that text is the most important record for scientific discovery, in this paper, we propose MolXPT, a unified language model of text and molecules pre-trained on SMILES (a sequence representation of molecules) wrapped by text. Briefly, we detect the molecule names in each sequence and replace them to the corresponding SMILES. In this way, the SMILES could leverage the information from surrounding text, and vice versa. The above wrapped sequences, text sequences from PubMed and SMILES sequences from PubChem are all fed into a language model for pre-training. Experimental results demonstrate that MolXPT outperforms strong baselines of molecular property prediction on MoleculeNet, performs comparably to the best model in text-molecule translation while using less than half of its parameters, and enables zero-shot molecular generation without finetuning.
翻译:生成式预训练Transformer(GPT)已在自然语言处理领域展现出巨大成功,相关技术也被应用于分子建模。鉴于文本是科学发现中最重要的记录形式,本文提出MolXPT——一种基于文本包裹的SMILES(分子序列表示)预训练的统一文本与分子语言模型。具体而言,我们检测每条序列中的分子名称并将其替换为对应的SMILES表示。通过这种方式,SMILES可借助周围文本的信息,反之亦然。上述包裹序列、来自PubMed的文本序列以及来自PubChem的SMILES序列均被输入语言模型进行预训练。实验结果表明,MolXPT在MoleculeNet的分子性质预测任务上超越强基线模型,在文本-分子翻译任务中达到与最佳模型相当的性能且参数量不足其一半,并能在无需微调的情况下实现零样本分子生成。