Chemistry plays a crucial role in many domains, such as drug discovery and material science. While large language models (LLMs) such as GPT-4 exhibit remarkable capabilities on natural language processing tasks, existing work shows their performance on chemistry tasks is discouragingly low. In this paper, however, we demonstrate that our developed LLMs can achieve very strong results on a comprehensive set of chemistry tasks, outperforming the most advanced GPT-4 across all the tasks by a substantial margin and approaching the SoTA task-specific models. The key to our success is a large-scale, comprehensive, high-quality dataset for instruction tuning named SMolInstruct. It contains 14 meticulously selected chemistry tasks and over three million high-quality samples, laying a solid foundation for training and evaluating LLMs for chemistry. Based on SMolInstruct, we fine-tune a set of open-source LLMs, among which, we find that Mistral serves as the best base model for chemistry tasks. We further conduct analysis on the impact of trainable parameters, providing insights for future research.
翻译:化学在药物发现和材料科学等多个领域扮演着至关重要的角色。尽管GPT-4等大型语言模型(LLMs)在自然语言处理任务上展现出卓越能力,但现有研究表明它们在化学任务上的表现令人沮丧地低下。然而,在本文中,我们证明了所开发的大型语言模型能够在全面的化学任务集上取得非常强大的结果,以显著优势全面超越最先进的GPT-4,并接近当前任务专用模型的SoTA水平。我们成功的关键在于一个命名为SMolInstruct的大规模、全面、高质量的指令微调数据集。该数据集包含14个精心挑选的化学任务和超过300万个高质量样本,为化学领域LLMs的训练与评估奠定了坚实基础。基于SMolInstruct,我们微调了一系列开源LLMs,其中我们发现Mistral是化学任务的最佳基础模型。我们进一步分析了可训练参数的影响,为未来研究提供了见解。