Recent advancements in large language models have opened new possibilities for generative molecular drug design. We present Chemlactica and Chemma, two language models fine-tuned on a novel corpus of 110M molecules with computed properties, totaling 40B tokens. These models demonstrate strong performance in generating molecules with specified properties and predicting new molecular characteristics from limited samples. We introduce a novel optimization algorithm that leverages our language models to optimize molecules for arbitrary properties given limited access to a black box oracle. Our approach combines ideas from genetic algorithms, rejection sampling, and prompt optimization. It achieves state-of-the-art performance on multiple molecular optimization benchmarks, including an 8% improvement on Practical Molecular Optimization compared to previous methods. We publicly release the training corpus, the language models and the optimization algorithm.
翻译:近年来,大语言模型的进展为生成式分子药物设计开辟了新的可能性。我们提出了Chemlactica和Chemma,这两种语言模型在一个包含1.1亿个具有计算属性的分子的新颖语料库上进行了微调,总计达400亿个词元。这些模型在生成具有指定属性的分子以及根据有限样本预测新分子特性方面表现出强大的性能。我们引入了一种新颖的优化算法,该算法利用我们的语言模型,在仅能有限次访问黑盒预言机的情况下,针对任意属性优化分子。我们的方法融合了遗传算法、拒绝采样和提示优化的思想。它在多个分子优化基准测试中取得了最先进的性能,其中在实用分子优化任务上相比先前方法提升了8%。我们公开了训练语料库、语言模型以及优化算法。