Large Language Models (LLMs), with their remarkable task-handling capabilities and innovative outputs, have catalyzed significant advancements across a spectrum of fields. However, their proficiency within specialized domains such as biomolecular studies remains limited. To address this challenge, we introduce Mol-Instructions, a meticulously curated, comprehensive instruction dataset expressly designed for the biomolecular realm. Mol-Instructions is composed of three pivotal components: molecule-oriented instructions, protein-oriented instructions, and biomolecular text instructions, each curated to enhance the understanding and prediction capabilities of LLMs concerning biomolecular features and behaviors. Through extensive instruction tuning experiments on the representative LLM, we underscore the potency of Mol-Instructions to enhance the adaptability and cognitive acuity of large models within the complex sphere of biomolecular studies, thereby promoting advancements in the biomolecular research community. Mol-Instructions is made publicly accessible for future research endeavors and will be subjected to continual updates for enhanced applicability.
翻译:大型语言模型凭借其卓越的任务处理能力和创新性输出,已推动了众多领域的重大突破。然而,其在生物分子等专业领域的能力仍存在局限。为解决这一挑战,我们推出了Mol-Instructions——一个专为生物分子领域精心构建的综合性指令数据集。Mol-Instructions包含三个核心部分:面向分子的指令、面向蛋白质的指令以及生物分子文本指令,各部分均旨在增强大型语言模型对生物分子特征与行为的理解和预测能力。通过在代表性大型语言模型上开展的广泛指令微调实验,我们突显了Mol-Instructions在提升模型对复杂生物分子研究领域的适应性与认知敏锐度方面的作用,从而推动生物分子研究社区的发展。Mol-Instructions将公开供未来研究使用,并将持续更新以增强其实用性。