Chemistry plays a crucial role in many domains, such as drug discovery and material science. While large language models (LLMs) such as GPT-4 exhibit remarkable capabilities on natural language processing tasks, existing work shows their performance on chemistry tasks is discouragingly low. In this paper, however, we demonstrate that our developed LLMs can achieve very strong results on a comprehensive set of chemistry tasks, outperforming the most advanced GPT-4 across all the tasks by a substantial margin and approaching the SoTA task-specific models. The key to our success is a large-scale, comprehensive, high-quality dataset for instruction tuning named SMolInstruct. It contains 14 meticulously selected chemistry tasks and over three million high-quality samples, laying a solid foundation for training and evaluating LLMs for chemistry. Based on SMolInstruct, we fine-tune a set of open-source LLMs, among which, we find that Mistral serves as the best base model for chemistry tasks. We further conduct analysis on the impact of trainable parameters, providing insights for future research.
翻译:化学在药物发现与材料科学等众多领域发挥着关键作用。尽管GPT-4等大语言模型在自然语言处理任务中展现出卓越能力,但现有研究表明其在化学任务上的表现令人失望地低下。然而,本文证明我们所开发的大语言模型能够在全面的化学任务集上取得非常优异的成果,以显著优势全面超越最先进的GPT-4,并接近任务特定模型的SoTA水平。我们成功的关键在于构建了一个名为SMolInstruct的大规模、全面、高质量指令微调数据集。该数据集包含14个精心筛选的化学任务与超过三百万个高质量样本,为训练和评估面向化学领域的大语言模型奠定了坚实基础。基于SMolInstruct,我们微调了一系列开源大语言模型,其中发现Mistral是化学任务的最佳基座模型。此外,我们进一步分析了可训练参数的影响,为未来研究提供了重要见解。